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Dropping Convexity for Faster Semi-definite Optimization 

Srinadh Bhojanapalli* Anastasios Kyrillidis', Sujay Sanghavi^ 


Abstract 

We study the minimization of a convex function /(A) over the set of n x n positive semi-definite ma¬ 
trices, but when the problem is recast as min u g(U) := f(UU T ), with U £ R riXr and r < n. We study 
the performance of gradient descent on g —which we refer to as Factored Gradient Descent (Fgd) —under 
standard assumptions on the original function /. 

We provide a rule for selecting the step size and, with this choice, show that the local convergence rate 
of Fgd mirrors that of standard gradient descent on the original f: i.e., after k steps, the error is O(lfk) 
for smooth /, and exponentially small in k when / is (restricted) strongly convex. In addition, we provide 
a procedure to initialize Fgd for (restricted) strongly convex objectives and when one only has access to / 
via a first-order oracle; for several problem instances, such proper initialization leads to global convergence 
guarantees. 

Fgd and similar procedures are widely used in practice for problems that can be posed as matrix factor¬ 
ization. To the best of our knowledge, this is the first paper to provide precise convergence rate guarantees 
for general convex functions under standard convex assumptions. 


1 Introduction 


Consider the following standard convex semi-definite optimization problem: 

minimize f(X) subject to X ^ 0, 


(1) 


where / : M" xn —i R is a convex and differentiable function, and X ^ 0 denotes the convex set over positive 
semi-definite matrices in R nx ™. Let A'* be an optimum of dTh with rank(A*) = r* < n. This problem can 
be remodeled as a non-convex problem, by writing A' = UU 1 where U is an n x r matrix. Specifically, define 
g(U ) := f(UU T ) anc0 consider direct optimization of the transformed problem, i.e., 

minimize q(U) where r < n. (q\ 

ue R” Xr 

Problems 0 and 0 will have the same optimum when r = r*. However, the recast problem is unconstrained 
and leads to computational gains in practice: e.g., iterative update schemes, like gradient descent, do not need 
to do eigen-decompositions to satisfy semi-definite constraints at every iteration. 

In this paper, we also consider the case of r < r* , which often occurs in applications. The reasons of such 
a choice chould be three-fold: (i) it might model better an underlying task (e.g., f may have arisen from a 
relaxation of a rank constraint in the first place), ( ii ) it leads to computational gains, since smaller r means 
fewer variables to maintain and optimize, (in) it leads to statistical “gains”, as it might prevent over-fitting in 
machine learning or inference problems. 

Such recasting of matrix optimization problems is empirically widely popular, especially as the size of problem 
instances increases. Some applications in modern machine learning includes matrix completion [2H HU HU |22j , 
affine rank minimization [eg hus, covariance / inverse covariance selection (38j [50], phase retrieval (111 [18] 
[TOIMj, Euclidean distance matrix completion [55] , finding the square root of a PSD matrix [40] . and sparse 
PCA ;2S], just to name a few. Typically, one can solve ([2]) via simple, first-order methods on U like gradient 
descent. Unfortunately, such procedures have no guarantees on convergence to the optima of the original /, 
or on the rate thereof. Our goal in this paper is to provide such analytical guarantees, by using—simply and 
transparently—standard convexity properties of the original /. 

Overview of our results. In this paper, we prove that updating U via gradient descent in 0 converges (fast) 
to optimal (or near-optimal) solutions. While there are some recent and very interesting works that consider 
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using such non-convex parametrization [32 J El, [7TJ [521 ESI HI], their results only apply to specific examples. 
To the best of our knowledge, this is the first paper that solves the re-parametrized problem with attractive 
convergence rate guarantees for general convex functions / and under common convex assumptions. Moreover, 
we achieve the above by assuming the first order oracle model: for any matrix X, we can only obtain the value 
/(X) and the gradient V/(X). 

To achieve the desiderata, we study how gradient descent over U performs in solving ©• This leads to the 
factored gradient descent (Fgd) algorithm, which applies the simple update rule 


U+ = U - rt\/f(UU T ) ■ U. (3) 

We provide a set of sufficient conditions to guarantee convergence. We show that given a suitable initialization 
point, Fgd converges to a solution close to the optimal point in sublinear or linear rate, depending on the 
nature of /. 

Our contributions in this work can be summarized as follows: 

(z) New step size rule and Fgd. Our main algorithmic contribution is a special choice of the step size rj. Our 
analysis showcase that 77 needs to depend not only on the convexity parameters of / (as is the case in standard 
convex optimization) but also on the top singular value of the unknown optimum. Section [ 3 ] describes the 
precise step size rule, and also the intuition behind it. Of course, the optimum is not known a priori. As a 
solution in practice, we show that choosing 77 based on a point that is constant relative distance from the 
optimum also provably works. 

(ii) Convergence of Fgd under common convex assumptions. We consider two cases: (z) when / is just a M- 
smooth convex function, and (ii) when / satisfies also restricted strong convexity (RSC), z.e., / satisfies 
strong-convexity-like conditions, but only over low rank matrices; see next section for definitions. Both cases 
are based on now-standard notions, common for the analysis of convex optimization algorithms. Given a 
good initial point, we show that, when / is M-smooth, Fgd converges sublinearly to an optimal point X*. 
For the case where / has RSC, Fgd converges linearly to the unique X *, matching analogous result for classic 
gradient descent schemes, under smoothness and strong convexity assumptions. 

Furthermore, for the case of smooth and strongly convex /, our analysis extends to the case r < r*, where 
Fgd converges to a point close to the best rank-?' approximation of X*j^] 

Both results hold when Fgd is initialized at a point with constant relative distance from optimum. Inter¬ 
estingly, the linear convergence rate factor depends not only on the convexity parameters of /, but also on 
the spectral characteristics of the optimum; a phenomenon borne out in our experiments. Section [4] formally 
states these results. 

(Hi) Initialization: For specific problem settings, various initialization schemes are possible (see [421 1641125] 1. In 
this paper, we extend such results to the case where we only have access to / via the first-order oracle: 
specifically, we initialize based on the gradient at zero, z.e., V/(0). We show that, for certain condition num¬ 
bers of /, this yields a constant relative error initialization (Section [5]). Moreover, Section [5] lists alternative 
procedures that lead to good initialization points and comply with our theory. 

Roadmap. The rest of the paper is organized as follows. Section [2] contains basic notation and standard 
convex definitions. Section [3] presents the Fgd algorithm and the step size 77 used, along with some intuition for 
its selection. Section [4] contains the convergence guarantees of Fgd; the main supporting lemmas and proofs 
of the main theorems are provided in Section [ 6 j In Section [ 5 j we discuss some initialization procedures that 
guarantee a “decent” starting point for Fgd. This paper concludes with discussion on related work (Section]?]). 

2 Preliminaries 

Notation. For matrices X, Y £ R raxn , their inner product is (X,Y) = Tr(X T F). Also, X ^ 0 denotes A' is 
a positive semi-definite (PSD) matrix, while the convex set of PSD matrices is denoted S” . We use ||X|| F and 
||X || 9 for the Frobenius and spectral norms of a matrix, respectively. Given a matrix X , we use <T m i n (A) and 
& max (AT to denote the smallest and largest strictly positive singular values of X and define r( A) = )( max W ; with 
a slight abuse of notation, we also use ay (A) = cr max (A) = || A|| 2 . X r denotes the rank-r approximation of A 
via its truncated singular value decomposition. Let r(X*) = denote the condition number of X*; again, 

2 In this case, we require \\X * — X*\\p to be small enough, such that the rank-constrained optimum be close to the best rank-r 
approximation of X*. This assumption naturally applies in applications, where e.g ., X * is a superposition of a low rank latent 
matrix, plus a small perturbation term El El]. In Section [I] we show how this assumption can be dropped by using a different 
step size 77, where spectral norm computation of two n x r matrices is required per iteration. 
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observe cr r ( X r ) = cr m i n (X r ). Qa denotes the basis of the column space of matrix A. srank (X) := II a 'IIf/||x||^ 
represents the stable rank of matrix X. We use e* £ K" to denote the standard basis vector with 1 at the *-th 
position and zeros elsewhere. 

Without loss of generality, / is a symmetric convex function, i.e., f(X) = /(A" T ). Let V/(X) denote the 
gradient matrix, i.e., its ( i,j) th element is [V/(X)]jj = . For X = UU T , the gradient of / with respect 

to U is (V f(UU T ) + V f(UU T ) T ) U = 2V/(X) • U, due to symmetry of /. Finally, let X* be the optimum of 
/(X) over §" with factorization X* = t/*([/*) T . 

For any general symmetric matrix X, let the matrix V + {X) be its projection onto the set of PSD matrices. 
This can be done by finding all the strictly positive eigenvalues and corresponding eigenvectors (A i,Vi : Xi > 0) 
and then forming V+{X) = )>A. a . >0 ^i v i v J■ 

In algorithmic descriptions, U and U + denote the putative solution of current and next iteration, respectively. 
An important issue in optimizing / over the U space is the existence of non-unique possible factorizations UU T 
for any feasible point X. To see this, given factorization X = UU T where U £ R nXr , one can define an class of 
equivalent factorizations UR T RU T = UU T , where R belongs to the set {i? £ K rxr : R T R = 1} of rotational 
matrices. So we use a rotation invariant distance metric in the factored space that is equivalent to distance in 
the matrix A' space, which is defined below. 

Definition 2.1. Let matrices U, V £ R nxr . Define: 

Dist (U,V) := min \\U-VR\\ F . 

O is the set of r x r oRhonormal matrices R, such that R T R = I r xr- The optimal R satisfies PQ T where 
PT,Q t is the singular value decomposition ofV T U. 

Assumptions. We will investigate the performance of non-convex gradient descent for functions / that satisfy 
standard smoothness conditions only, as well as the case where / further is (restricted) strongly convex. We 
state these standard definitions below. 

Definition 2.2. Let f : S" —> R. be convex and differentiable. Then, f is m-strongly convex if: 

f(Y) > /(X) + (V/ (X), y - X) + f \\Y - X\\ 2 f , VX,F£§^. (4) 

Definition 2.3. Let f : §™ —> R be a convex differentiable function. Then, f is M-smooth if: 

IIV/ (X) — V/ (D)|| f < M ■ \\X - Y\\ F , X,Y £ S£. (5) 

This further implies the following upper bound: 

f(X) < /(X) + (V/ (X), Y - X) + f \\Y - Xf F . (6) 

Given the above definitions, we define k = ^ as the condition number of function /. 

Finally, in high dimensional settings, often loss function / does not satisfy strong convexity globally, but 
only on a restricted set of directions; see [SHI 121 and Section |F] for a more detailed discussion. 

Definition 2.4. A convex function f is (m,r)-restricted strongly convex if: 

f(Y) > /(X) + (V/ (X), Y — X) + y — X || 2 p , for any rank-r matrices X,Y £ S" . (7) 

3 Factored gradient descent 

We solve the non-convex problem <[2j) via Factored Gradient Descent (Fgd) with update rukj^J 

U+ = U- gXf(UU T ) ■ U. 

Fgd does this, but with two key innovations: a careful initialization and a special step size 77. The discussion 
on the initialization is deferred until Section [U 

Step size 77. Even though / is a convex function over X 0, the fact that we operate with the non-convex 
UU T parametrization means that we need to be careful about the step size 77; e.g ., our constant 77 selection 
should be such that, when we are close to X*, we do not “overshoot” the optimum A'*. 

In this work, we pick the step size parameter, according to the following closed-forn^] 

■’’The true gradient of / with respect to U is 2 V f(UU ' ) ■ U. However, for simplicity and clarity of exposition, in our algorithm 
and its theoretical guarantees, we absorb the 2 -factor in the step size 7;. 

4 Constant 16 in the expression § appears due to our analysis, where we do not optimize over the constants. One can use 
another constant in order to be more aggressive; nevertheless, we observed that our setting works well in practice. 
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Algorithm 1 Factored gradient descent (Fgd) 


V = 


16(M || A 0 1 | 2 + ||V/(X°)|| 2 )' 


( 8 ) 


input Function /, target rank r, iterations K. 
Compute A ' 0 as in (12). 

Set [/ £ R nxr such that A 0 = UU T . 

Set step size 77 as in 


Recall that, if we were just doing standard gradient 
descent on /, we would choose a step size of 1 /m, 
where M is a uniform upper bound on the largest 
eigenvalue of the Hessian V 2 /(-). 

To motivate our step size selection, let us consider 
a simple setting where U £ R raxr with r = 1; i.e., U is 
a vector. For clarity, denote it as u. Let / be a separa¬ 
ble function with /(X) = JA. Furthermore, 

define the function g : R" —»• R such that f(uu T ) = g(u). It is easy to compute (see Lemma E.l): 


for k = 0 to K — 1 do 
U+ = U- gXf(UU T ) ■ U. 

u = u+. 

end for 
output X = UU T . 


V g(u) = V/(mm T ) ■ u £ R* and X 2 g(u) = mat ^diag(V 2 /(im T )) ■ vec j + V/(wm t ) £ 


where mat : R " 2 —► R” xn , vec : R" xn —>• R™“ and, diag : R "’" x ™ 2 — y R ™“ xri2 are the matricization, vectorization 
and diagonalization operations, respectively; for the last case, diag generates a diagonal matrix from the input, 
discarding its off-diagonal elements. We remind that V f(uu T ) £ R raxn and X 2 f(uu T ) £ R" xn . Note also 
that V 2 /(X) is diagonal for separable /. 

Standard convex optimization suggests that 77 should be chosen such that 77 < l /\\V 2 g{-)\U- The above suggest 
the following step size selection rule for M-smooth /: 


77 < Iiv 2 s(-)ll2 ^ m||x|| 2 +||v/(x)|| 2 - 

In stark contrast with classic convex optimization where r\ oc , the step size selection further depends on the 
spectral information of the current iterate and the gradient. Since computing ||A '|| 2 , ||V/(X )|| 2 per iteration 
could be computational inefficient, we use the spectral norm of A 0 and its gradient V/(A°) as surrogate, where 
A 0 is the initialization poinfj^J 

To clarify 77 selection further, we next describe a toy example, in order to illustrate the necessity of such a 
scaling of the step size. Consider the following minimization problem. 

minimize f(uu T ) := \ \uu T — Ylli., 
uer xl 


where u= U £ R nxl —and thus, A = uu T , i.e., we are interested in rank-1 solutions—and Y is a given rank-2 
matrix such that Y = a 2 vivj — /3 2 V2i’J , for a > [3 £ R. and vi,V 2 orthonormal vectors. Observe that / is 
a strongly convex function with rank-1 minimizer A* = a 2 vivj; let u* = av\. It is easy to verify that (i) 
IIA *|| 2 = a 2 , («) ||V/(X *)|| 2 = ||2 • (A* - Y) || 2 = 2/3 2 , and (in) ||V/(A x ) - V/(A 2 )|| F < M ■ \\X 1 - X 2 || F , 
where M = 2. 

Consider the case where u = ^v\ + ^u 2 is the current estimate. Then, the gradient of / at u is evaluated 
as: 

X f(uU T ) ■ U = 2 -t-iygl -V 2 vJ^ ■ (f«1 + ^U 2 ) = 1 + ^f~V 2 . 

Hence, according to the update rule u + = u — 2rjX f(uu T ) ■ u, the next iterate satisfies: 



Observe that coefficients of both vi, tj 2 in u + include 0(a 3 ) and 0(f3 3 ) quantities. 

The quality of u + clearly depends on how 77 is chosen. In the case 77 = jj = such step size can result in 
divergence/“overshooting”, as ||A'*|| 2 = 0(a 2 ) and ||V/(X '*)|| 2 = 0(/3 2 ) can be arbitrarily large (independent 
of M). Therefore, it could be the case that Dist(m + ,m*) > Dist(w,m*). 

In contrast, consider the step siz^j 77 = 16 (m|| y*||->+||v/( y*)|| 2 ) ^ C(a^+p 2 ) • Then, with appropriate scaling 
C, we observe that 77 lessens the effect of 0(a 3 ) and 0(/3 3 ) terms in v\ and u 2 terms, that lead to overshooting 
for the case 77 = \. This most possibly result in Dist(w + ,m*) < Dist(u, u*). 

5 However, as we show in Section|I] one could compute ||X||2 and || V/(X)||2 per iteration in order to relax some of the requirements 
of our approach. 

6 For illustration purposes, we consider a step size that depends on the unknown X*; in practice, our step size selection is a 
surrogate of this choice and our results automatically carry over, with appropriate scaling. 
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Computational complexity. The per iteration complexity of Fgd is dominated by the gradient computa¬ 
tion. This computation is required in any first order algorithm and the complexity of this operation depends on 
the function /. Apart from V/(A), the additional computation required in Fgd is matrix-matrix additions and 
multiplications, with time complexity upper bounded by nnz(V/(-)) • r, where nnz(V/(-)) denotes the number 
of non zeros in the gradient at the current pointQ Hence, the per iteration complexity of Fgd is much lower 
than traditional convex methods like projected gradient descent m or classic interior point methods [621163] , 
as they often require a full eigenvalue decomposition per step. 

Note that, for r = 0(n), Fgd and projected gradient descent have same per iteration complexity of 0 (n 3 ). 
However, Fgd performs only a single matrix-matrix multiplication operation, which is much “cheaper” than 
a SVD calculation. Moreover, matrix multiplication is an easier-to-parallelize operation, as opposed to eigen 
decomposition operation which is inherently sequential. We notice this behavior in practice; see Sections 00 
and [H] for applications in matrix sensing and quantum state tomography. 


4 Local convergence of Fgd 

In this section, we present our main theoretical results on the performance of Fgd. We present convergence 
rates for the settings where ( i ) / is a M-smooth convex function, and ( ii ) / is a M-smooth and (m, r)-restricted 
strongly convex function. These assumptions are now standard in convex optimization. Note that, since the 
UU T factorization makes the problem non-convex, it is hard to guarantee convergence of gradient descent 
schemes in general, without any additional assumptions. 

We now state the main assumptions required by Fgd for convergence: 

Fgd Assumptions 

• Initialization: We assume that Fgd is initialized with a “good” starting point X° = U°(U°) T that has 
constant relative error to X* = t/*(t/*) T |^] In particular, we assume 

(Al) DiST(f7°, [//) < pcr r (U*) forp^^^gj (Smooth/) 

(-42) Vist(U°,U*) < p'a r (U*) for p'■= mk lllx*) (Strongly convex /), 

for the smooth and restricted strongly convex setting, respectively. This assumption helps in avoiding 
saddle points, introduced by the U parametrizatioij^] 

In many applications, an initial point U° with this type of guarantees is easy to obtain, often with just 
one eigenvalue decomposition; we refer the reader to the works gH [Ml 123 (HI ED for specific initialization 
procedures for different problem settings. See also Section [5] for a more detailed discussion. Note that the 
problem is still non-trivial after the initialization, as this only gives a constant error approximation. 

• Approximate rank-r optimum: In many learning applications, such as localization [43) and multilabel 
learning m , the true X* emerges as the superposition of a low rank latent matrix plus a small perturbation 
term, such that ||X* — X*||^ is small. While, in practice, it might be the case rank(X*) = n— due to 
the presence of noise—often we are more interested in revealing the latent low-rank part. As already 
mentioned, we might as well set r < rank(X*) for computational or statistical reasons. In all these cases, 
further assumptions w.r.t. the quality of approximation have to be made. In particular, let X * be the 
optimum of |l]) and / is M-smooth and (m, r)-strongly convex. In our analysis, we assume: 

(-43) \\X* - X*\\ F < 2 d^^^ a r(X*) (Strongly convex /), 

This assumption intuitively requires the noise magnitude to be smaller than the optimum and constrains 
the rank constrained optimum to be closer to X* p^j 

7 It could also occur that gradient V/(X) is low-rank, or low-rank + sparse, depending on the problem at hand; it could also 
happen that the structure of V/(X) leads to “cheap” matrix-vector calculations, when applied to vectors. Here, we state a more 
generic -and maybe pessimistic— scenario where V/(X) is unstructured. 

8 If r = r* , then one can drop the subscript. For completeness and in order to accommodate the approximate rank-r case, 
described below, we will keep the subscript in our discussion. 

9 To illustrate this consider the following example, 

minimize f(UU T ) := \\UU T - U*(U*) T \\ 2 F . 

[/GI^xt- 

Now it is easy to see that Dist(£/*_ 15 U*) = a r (U*) and U* —1 is a stationary point of the function considered 
(V/(C/’*_ 1 (C/*_ 1 ) T ) • C/*_i = 0). We need the initial error to be further smaller than a r (U *) by a factor of condition number 
of X*. 

10 Note that the assumption (A3) can be dropped by using a different step size 77 (see Theorem |l.4| in Section [I]). However, this 
requires two additional spectral norm computations per iteration. 
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We note that, in the results presented below, we have not attempted to optimize over the constants appearing 
in the assumptions and any intermediate steps of our analysis. Finding such tight constants could strengthen 
our arguments for fast convergence; however, it does not change our claims for sublinear or linear convergence 
rates. Moreover, we consider the case r < rank(X*); we believe the analysis can be extended to the setting 
r > rank(A'*) and leave it for future workpj 


4.1 1 jk convergence rate for smooth / 

Next, we state our first main result under smoothness condition, as in Definition |2.3[ In particular, we prove 
that Fgd makes progress per iteration with sublinear rate. Here, we assume only the case where r = r*; for 
consistency reasons, we denote X* = X*. Key lemmas and their proofs for this case are provided in Section |Cj 

Theorem 4.1 (Convergence performance for smooth /). Let X * = UfUf denote an optimum of M-smooth f 
over the PSD cone. Let f(X°) > f(Xf). Then, under assumption (41), after k iterations, the FGD algorithm 
finds solution X k such that 


f(X k ) - f(Xf) < 


| • Dist (U°,Uf) 2 

/„ , 5 PiST(C°.fi r *) 2 • 

K + V ' Xf) 


(9) 


The theorem states that provided (z) we choose the step size ?y, based on a starting point that has constant 
relative distance to Up, and (ii) we start from such a point, gradient descent on U will converge sublinearly to 
a point X*. In other words, Theorem 4.1 shows that Fgd computes a sequence of estimates in the [/-factor 
space such that the function values decrease with O (^) rate, towards a global minimum of / function. Recall 
that, even in the standard convex setting, classic gradient descent schemes over X achieve the same O(^) 
convergence rate for smooth convex functions m- Hence, Fgd matches the rate of convex gradient descent, 
under the assumptions of Theorem |4.1| The above are abstractly illustrated in Figure [l] 


Set of optimal points X* 



Figure 1 : Abstract illustration of Theorem | 4 . 1 | and the behavior of Fgd in the case where / is just M-smooth. 
The grey-shaded area represents the set of optimum solutions X* = Xf. Let the orange triangle denote the 
optimum, close to which Fgd converges; the dashed red circle denotes the optimization tolerance/error. 


4.2 Linear convergence rate under strong convexity assumption 

Here, we show that, with the additional assumption that / satisfies the (m, r)-restricted strong convexity over 
§", Fgd achieves linear convergence rate. The proof is provided in Section [B| 

11 Experimental results on synthetic matrix sensing settings have shown that, if we overshoot r, i.e., r > rank(.Y*), Fgd still 
performs well, finding an E-accurate solution with linear rate. 
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Theorem 4.2 (Convergence rate for restricted strongly convex /). Let the current iterate be U and X = UU T . 
Assume Dist([/, U*) < p'cr r (U*) and let the step size be rj = 16 ( M ||y°|| +||v/(A'»)|| ) • Then under assumptions 
( A2 ), (A3), the new estimate U + = U — ?yV/(A') • U satisfies 

Dist({7 + , Uf ) 2 < a ■ Dist (17, Uf ) 2 + /3 ■ \\X* - Xf\\ 2 F , (10) 


where a = 1- 6 ^ M \\x™ 2 +tvf(x*)\\ 2 ) and P = 2 8 (m||x«|| 2 +||v/(a:«)|| 2 ) • Furthermore, U+ satisfies Dist(C/+, U*) < 
p'a r (Uf). 

The theorem states that provided (?) we choose the step size based on a point that has constant relative 
distance to Uf, and (ii) we start from such a point, gradient descent on U will converge linearly to a neighborhood 
of U*. The above theorem immediately implies linear convergence rate for the setting where / is standard 
strongly convex, with parameter m. This follows by observing that standard strong convexity implies restricted 
strong convexity for all values of rank r. 

Last, we present results for the special case where r = r*; in this case, Fgd finds an optimal point Uf with 
linear rate, within the equivalent class of orthonormal matrices in O. 


Corollary 4.3 (Exact recovery of X*). Let X 
that rank(X *) = r. Consider X as in Theorem 4-% 
convergence factor a as in Theorem \4-ty we have 


be the optimal point of f, over the set of PSD matrices, such 
Then, under the same assumptions and with the same 


Dist([/ + , U*) 2 < a ■ Dist (U, U *) 2 . 


Further, for r = n we recover the exact case of semi-definite optimization. In plain words, the above 
corollary suggests that, given an accuracy parameter e, Fgd requires K = O (log (Ye)) iterations in order to 
achieve Dist([/' R ”, U*) 2 < e; recall the analogous result for classic gradient schemes for M-smooth and strongly 
convex functions /, where similar rates can be achieved in X space m- The above are abstractly illustrated 
in Figure [2] 


m Sequence of estimates when T < rank(X*) / 

* 

• Sequence of estimates when r = rank(X*) > 

I 

- - Level sets of the objective f ' 

▲ Unique optimal X* 

\ 

♦ Rank-r approximation X r \ 

\ 

• • • • Optimization tolerance * 


lix*-x;yr 


\ * 

\ % 

\ S 
\ \ 
l % 
I t 




Rank-r X° m 



* I 
* I 


Rank-r X° 


Figure 2: Abstract illustration of Theorem 4.2 and Corollary 4.3 The two curves denote the two cases: (i) 


r = rank(A*) and, (ii) r < rank(X*). (?) In the first case, the triangle marker denotes the unique optimum X* 
and the dashed red circle denotes the optimization tolerance/error, (ii) In the case where r < rank(X*), let the 
cyan circle with radius c||X* — (set c = 1 for simplicity) denote a neighborhood around X*. In this case, 

Fgd converges to a rank-r approximation in the vicinity of A'* in sublinear rate, according to Theorem |4.2| 


Remark 1. By the results above, one can easily observe that the convergence rate factor a, in contrast to 
standard convex gradient descent results, depends both on the condition number of Xf and ||V/(A”*)|| 2 , in 
addition to k. This dependence is a result of the step size selection, which is different from standard step sizes, 
i.e., 1 /m for standard gradient descent schemes. We also refer the reader to Section^for some discussion. 

As a ramification of the above, notice that a depends only on the condition number of X* and not that of 
A*. This suggests that, in settings where the optimum A* has bad condition number (and thus leads to slower 
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convergence), it is indeed beneficial to restrict Utobeanxr matrix and only search for a rank-r approximation 
of the optimal solution, which leads to faster convergence rate in practice; see Figure [8] in our experimental 


findings at the end of Section F.3 


Remark 2. In the setting where the optimum X* is 0, directly applying the above theorems requires an initial¬ 
ization that is exactly at the optimum 0. On the contrary, this is actually an easy setting and the Fgd converges 
from any initial point to the optimum. 


5 Initialization 


In the previous section, we show that gradient descent over U achieves sublinear/linear convergence, once the 
iterates are closer to Up. Since the overall problem is non-convex, intuition suggests that we need to start from 
a “decent” initial point, in order to get provable convergence to Up. 

One way to satisfy this condition for general convex / is to use one of the standard convex algorithms and 
obtain U within constant error to U* (or Up); then, switch to Fgd to get the high precision solution. See 
ED for a specific implementation of this idea on matrix sensing. Such initialization procedure comes with the 
following guarantees; the proof can be found in Section [Dj 

Lemma 5.1. Let f be a M-smooth and (m,r)-restricted strongly convex function over PSD matrices and let 
X* be the minimum of f with rank(X*) = r. Let X + = V+(X — jj\ 7 f(X)) be the projected gradient descent 
update. Then, \\X + — X|| F < ) °v(X) implies. 


Dist (U r ,Uf) < T (x*) a r(Uf), for constants c, c. > 0. 

Next, we present a generic initialization scheme for general smooth and strongly convex /. We use only the 
first-order oracle: we only have access to—at most—gradient information of /. Our initialization comes with 
theoretical guarantees w.r.t. distance from optimum. Nevertheless, in order to show small relative distance in 
the form of Dist (U°,Up) < pa r (U*), one requires certain condition numbers of / and further assumptions on 
the spectrum of optimal solution X * and rank r. However, empirical findings in Section |F.3| show that our 
initialization performs well in practice. 

Let V/(0) £ R™ xn . Since the initial point should be in the PSD cone, we further consider the projection 
"P+(—V/(0)). By strong convexity and smoothness of /, one can observe that the point 1 /m ■ V+ (—V/(0)) is a 
good initialization point, within some radius from the vicinity of X*; i.e., 


I M 


V + (-Vm)-X*\\ F < 2(l-f)||X*|| F ; 


( 11 ) 


see also Theorem 5.2 Thus, a scaling of "P + (—V/(0)) by M could serve as a decent initialization. In many 
recent works [32 1641 1191 [521125] this initialization has been used for specific applications^] Here, we note that 
the point 1 /m ■ V+ (—V/(0)) can be used as initialization point for generic smooth and strongly convex /. 

The smoothness parameter M is not always easy to compute exactly; in such cases, one can use the surrogate 
m < || V/(0)— Xf(eiej )|| F < M. Finally, our initial point U° £ R" xr is a rank-r matrix such that X° = U°U 0T . 
We now present guarantees for the initialization discussed. The proof is provided in Section [D. 2 [ 

Theorem 5.2 (Initialization). Let f be a M-smooth and m-strongly convex function, with condition number 
k = —, and let X* be its minimum over PSD matrices. Let A' 0 be defined as: 


X u := 


l|V/(0)-V/(eieiT)|| F 


r+ (—v/(o)), 


( 12 ) 


and X® is its rank-r approximation. Let ||X* — X*|| F < p ||X *|| 2 for some p. Then, Dist(17 0 , Up) < yoy (Up), 
where 7 = 4t(X*)a/2t • {^/k 2 — 2 /k + 1 (srank 1 / 2 (X*) + p) + pj and srank(Xp) = jpjTjp 1 . 

While the above result guarantees a good initialization for only small values of k, in many applications 
@2 m Esa, this is indeed the case and X° has constant relative error to the optimum. 

To understand this result, notice that in the extreme case, when / is the i 2 loss function ||X — X*|||., which 
has condition number n = 1 and rank(A'*) = r, X° indeed is the optimum. More generally as the condition 
number n increases, the optimum moves away from X° and the above theorem characterizes this error as a 
function of condition number of the function. See also Figure [3] 

Now for the setting when the optimum is exactly rank-r we get the following result. 

12 To see this, consider the case of least-squares objective f(X) := !,; A(X) — y |||, where y denote the set of observations and, 
A is a properly designed sensing mechanism, depending on the problem at hand. For example, in the affine rank minimization 
case 18211231 . (A(X)) t represents the linear system mechanism where Tr(/\, ■ X ) = b ,. Under this setting, computing the gradient 
V/(-) at zero point, we have: —V/( 0) = A' (y), where A* is the adjoint operator of A. Then, it is obvious that the operation 
V+ (—V/(0)) is very similar to the spectral methods, proposed for initialization in the references above. 
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Figure 3: Abstract illustration of initialization effect on a toy example. In this experiment, we design X* = 
U*U* T where U* = [1 1] T (or U* = —[1 1] T —these are equivalent). We observe X* via y = vec (A ■ X*) 
where A G R 3x2 is randomly generated. We consider the loss function f(UU T ) = \ \\y — vec (A • UU T ) |||. Left 
panel: / values in logaritliimic scale for various values of variable U € R 2xl . Center panel: Contour lines of / 
and the bahavior of Fgd using our initialization scheme. Right panel: zoom-in plot of center plot. 


Corollary 5.3 (Initialization, exact). Let X * be rank-r for some r < n. Then, under the conditions of 
Theorem \5.2\ we get 

Dist (U°, U*) < 4%/2 rr(X*) ■ ^n 2 - 2 /« + 1 • a r (Uf). 

Finally, for the setting when the function satisfies (m, r)-restricted strong convexity, the above corollary still 
holds as the optimum is a rank-r matrix. 


6 Convergence proofs for the Fgd algorithm 


In this section, we first present the key techniques required for analyzing the convergence of Fgd. Later, we 
present proofs for both Theorems |4.1| and |4.2| Throughout the proofs we use the following notation. X* 
is the optimum of problem ([!]) and X* = Uf RfjfU* 7?^) T is the rank-r approximation; for the just smooth 
case, X* = X* . as we consider only the rank-r* case and r = r*. Let Rfj := axgrma. R . Re0 \\U — UfR\\F and 
A-r i.f.R;. 

A key property that assists classic gradient descent to converge to the optimum X * is the fact that ( X + — 
X , X * — X) > 0 for a smooth convex function /; in the case of strongly convex /, the inner product is further 
lower bounded by ™||A — X*|||, (see Theorem 2.2.7 of [HOD- Classical proofs mainly use such lower bounds to 
show convergence (see Theorems 2.1.13 and 2.2.8 of EDI)- 

We follow broadly similar steps in order to show convergence of Fgd. In particular, 


In section 


6.1 


we show a lower bound for the inner product {U — U + ,U — UfR.f) (Lemma 6.1), even 
though the function is not convex in U. The initialization and rank-r approximate optimum assumptions 
play a crucial role in proving this, along with the fact that / is convex in A'. 


In sections 6.2 and 6.3 we use the above lower bound to show convergence for (i) smooth and strongly /, 
and (ii) just smooth /, respectively, similar to the convex setting. 


6.1 Rudiments of our analysis 

Next, we present the main descent lemma that is used for both sublinear and linear convergence rate guarantees 
of Fgd. 

Lemma 6.1 (Descent lemma). For f being a M-smooth and (m, r) -strongly convex function and, under as¬ 
sumptions (A2) and (A3), the following inequality holds true: 

k(U-U + ,U- UfRfj) > §7711 V/(A)[/||! + M . oy(A*)Dis T (/7, Uf ) 2 - f ||A* - X*\f F . 

Further, when f is just M-smooth convex function and, under the assumptions /( X + ) > f(X*) and (Al), we 
have: 


h{u-u + ,u- UfRl) > \ V \\Xf{X)Uf F . 
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Proof. First, we rewrite the inner product as shown below. 


(v/(x)c, u - ufRfj) = (v/(x), X - u:r* v u t ) 

= * (V/(X), X - X*) + (vf(X), i(X + X*) - UfRfjU^ 

= i (V/(X), X - X*) + 1 (V/(X), AA t ) , (13) 

which follows by adding and subtracting |A'*. 

• Strongly convex / setting. For this case, the next 3 steps apply. 


Step I: Bounding (V/(X),X — A'*). The first term in the above expression can be lower bounded using 
smoothness and strong convexity of / and, involves a construction of a feasible point X. We construct such a 
feasible point by modifying the current update to one with bigger step size rj. 

Lemma 6.2. Let f be a M-smooth and (m.,r)-restricted strongly convex function with optimum point X*. 
Moreover, let X* be the best rank-r approximation of X*. Let X = UU T . Then, 


(V / (X), X - X*) > m\Xf(X)U\\ 2 F + f ||X - x:f F - f ||X* - X, 


rill, 


wher ev= m M\\x h+ \\vnx)Q', Q ; T h) ^ by Lemma 

Proof of this lemma is provided in Section |B.1[ 


A.5 


Step II: Bounding (V/(X), AA T ). The second term in equation (13) can actually be negative. Hence, we lower 
bound it using our initialization assumptions. Intuitively, the second term is smaller than the first one as it 
scales as Dist(17, t/*) 2 , while the first term scales as Dist({7, Uf). 

Lemma 6.3. Let f be M-smooth and {m,r)-restricted strongly convex. Then, under assumptions (A2) and 
(A4), the following bound holds true: 

(V/(X), AA t ) > -§\\Xf(X)U\\ 2 F - +M\\X* -X r *|| F ) • DiST([/,l/ r *) 2 . 

Proof of this lemma can be found in Section IB.21 


Step III: Combining the bounds in equation (13). For a detailed description, see Section B.3 


Smooth / setting. For this case, the next 3 steps apply. 


Step I: Bounding (V/(X),X — A'*). Similar to the strongly convex case, one can obtain a lower bound on 
(V/(X),X — A'*), according to the following Lemma: 

Lemma 6.4. Let f be a M-smooth convex function with optimum point X*. Then, under the assumption that 
f(X+) > /(X*), the following holds: 


(v/(x),x — x*) > m\\vf(x)u\\ 2 F . 

The proof of this lemma can be found in Appendix [Cj 

Step II: Bounding (V/(X), AA T ). Here, we follow a different path in providing a lower bound for (V/(X), AA T ). 
The following lemma provides such a lower bound. 

Lemma 6.5. Let X = UU T and define A := U — UfRfj. Let f{X + ) > f(Xf). Then, for Dist (U,Uf) < 
pa r (Uf), where p = iqq o-)( y*) > anc ^ / being a M-smooth convex function, the following lower bound holds: 

<V/(X),AA t ) > • yig ■ |(V/(X),X — X*)|. 

^ 2 100 

The proof of this lemma can be found in Appendix [Cj 
Step III: Combining the bounds in equation ©■ For a detailed description, see Section [C| and Lemma 

□ 


C.4 
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6.2 Proof of linear convergence (Theorem 4.2) 

The proof of this theorem involves showing that the potential function Dist([/, U*) is decreasing per iteration 
(up to approximation error | \X* — X* 
obtain 


f), using the descent Lemma 6.1 Using the algorithm’s update rule, we 


Dist ([/+, U*y = min o \\U - UfR\\ 2 F 

< \\u + -u:r\j\\ 2 f 


= \\U+ -U + U-U*Rtj\\ F 

= 11 u+- U\\l + II U - UfR * v \\ 2 f -2(U + - U, U*Ry - U ), 


(14) 


which follows by adding and subtracting U and then expanding the squared term. 


Step I: Bounding term (U — U + , U — U*R) in (14). By Lemma 6.1 we can bound the last term on the right 
hand side as: 


(Vf(X)U, U - UfRfj) > $ri\\Vf(X)U\\jr + # • a r (X*)Di ST ([/, Uf) 2 - f ||X* - X}\\‘ F . 

Furthermore, we can substitute U + in the first term to obtain ||[/ + — U\\ 2 F = r/ 2 \\V f(X)U\\ F . 

Step II: Combining bounds into (14). Combining the above two equations (14) becomes: 

dist(u + , u;) 2 < v 2 \\x7f(x)u\\ 2 F + 1| t/ - u;ri\\ 2 f 

- 2g (±T)\\Vf(X)U\\% + ^ ■ o r (X*)DiST(U, Uf) 2 - f \\X* - x;||“ ) 
= II U - U* r Rl\\l + II** - x r\\l + V 2 (\\Xf(X)U\\ 2 F - | \\Vf(X)U\\ 2 F ) 


_ 377177 


a r (X*)BiST(U,U* 


(i) 


< II U - U*Ru\\ F + \\X* - X;\\ F - ^ ■ oy(X*)Dist([/, u;y 

= (1 - ^ ■ <X r (X*)) ■ DlST([/, u:f + ||X* - x;f F 


(ii) 


< 1 _ 2H1 . 

— 1 1 10 


3m 10r7 


11 ?7* 

~20~ 


< (1 - - 

(Hi) , 

< ( 1 - 


(X*)) • Dist(C7, Ur ) 2 + M 
(X*)) Dist(C7, u;f + ||X* - X 

) dist(u, u*y 


IX* -X* 


2 

V 11 


m<r r (X*) 


64(M||X*||2 + ||V/(X*)|| 2 ), 


+ 28(Af||X*|| 2 + ||V/(X*)|| 2 ) II* X r\\ F , 


where (!) is du e to removing the negative part from the right hand side, (ii) is due to j(t]* < r) < U 77 * 
by Lemma A.5 (Hi) follows from substituting g* = 16 (M||v*|| 2 +l|v/(v*)||.q ■ This proves the first part of the 
theorem. 

Step III: U + satisfies the initial condition. Now we will prove the second part. By the above equation, we have: 


Dist (U + , Uf) 2 < (l - ^ • a r (X*)j Dist(17, U*) 2 + 


HMr/ 

20 


IX*-X 


r\\F 


(<) 


< (l - *£ • °r(X*)) (, o') 2 a r (X *) + a 2 (X*) 

(p') 2 a r (X*) (l - • a r (X *) + ■ a r (X*j) 

C o') 2 a r (X *) (l -rsf. a r (X*) + *fa r (X * 


< (, J) 2 <Tt(X *). 

(i) follows from substituting the assumptions on Dist([/,[/*) an d ||** — X*||f and the last inequality is due 
to the term in the parenthesis being less than one. 


6.3 Proof of sublinear convergence (Theorem 4.1) 

Here, we show convergence of Fgd when / is only a M-smooth convex function. At iterate fc, we assume 
f(X k ) > /(X*); in the opposite case, the bound follows trivially. Recall the updates of Fgd over the [/-space 
satisfy 

U+ = U - rjS7f(X)U. 
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It is easy to verify that X + = U + ( U + ) T = X — rjS7 f(X)XA — r]A T X\7 f(X), where A = I — f Qc/Qj/V/(X) £ 
I"X", Notice that for step size 77 , using Lemma A.5 we get, 


A >- 0 , ||A || 2 < 1 + V 32 and cr n (A) > 1 — Y 32 . 


(15) 


Our proof proceeds using the smoothness condition on /, at point X + . In particular, 


f(X + ) < f(X) + <V/(X),X+ - X) + f ||X + - X\\% 

< f(X) - 2 77 • a n ( A) • \\Xf(X)U\\ 2 F + 2 Mr, 2 ■ \\Vf(X)U\\ 2 F • ||X || 2 • ||A||| 

(a) 0 

< f(X) - Tf ■ \\Xf(X)U\\ 2 F + n . (|) 2 . \\X7f(X)U\\ 2 F 

<f(X)-^\\Vf(X)U\\%, 

where (i) follows from symmetry of V/(X), X and 


Tr(V/(X)V/(X)XA) = TV(V/(X)V f(X)UU T ) - | Tr(V/(X)V/(X)t/t/ T V/(X)) 

> (i - |llQc/QyV/(x)|| 2 )||v/(x)c/|||, 

— (1 V 32 ) I! v/(x) 


and (ii) is due to (15) and the fact that 77 < 16 M |^ A -o || 2 < 14M jj v || 2 ( see Lemma A.5). 


Hence, 


(16) 


/(X + ) - /(X*) < /(X) - /(X*) - ^wxfixml. (17) 

To bound the term /(X) — /(X*) on the right hand side of we use standard convexity as follows: 
/(X) - /(X*) < (V/(X),X - X*) 

= 2 (V/(X), t/t/ T - U*R^U t ) - (V/(X), C/t/ T + U*R f (U*Ru) t - 2U*R* u U r ) 

= 2 (vf(x)u, u - u*Rij} - (v/(x), (t/ - c/*^)(t/ - t«) T ) 

( = 2 (V/(X)f/, A) - (V/(X), AA t ) 

< 2(V/(X)C/, A) + |(V/(X), AA t )| 

(izi) 

< 2 • ||V/(X)£/||f • Dist(U, [?) + X||V/(X)t /|| 2 • Dist(17, 17*) 

(zv) 

< |||V/(X)t/|| F -DiST(t/,C/ r *), (18) 


where (7) is due to X = UU T and X* = U*R f (U*R\j) for orthonormal matrix R F £ M rxr ', (ii) is by 


A := U — U*Ry, (Hi) is due to Cauchy-Schwarz inequality and Lemma C.l and, (iv) is due to norm ordering 

I 2 < 


f- 


From (18), we obtain to the following bound: 


l|V/(X)U|| F > 1 • 


(19) 


Define 5 = f(X) — /(X*) and <5 + = /(X + ) — /(X*). Moreover, by Lemma C.5 we know that Dist(17, 17*) < 
Dist([/°, U*) for all iterations of FGD; thus, we have DlST p 7 u*\ > dist(c/ 0 u*) f° r ever y update U. Using the 
above definitions and substituting (19) in we obtain the following recursion: 




< 6 - 


5-Dist (U°,U* 


S 2 => 6+ < S (1 - 


which can be further transformed as: 


( 


1 5-DisT(f/°,t/;) 2 ' 1 

¥ “ 6 


1 1 
— > - 
6+ ~ S 


v . A > I 

5-Dist (U°,U*) 2 ~ § 


5-Dist((7° ,U*) 2 


_ V _ 

5-Dist(C/°,7/*) 2 


since 5 + < 5 from equation Since each <5 and S + correspond to previous and new estimate in FGD per 

iteration, we can sum up the above inequalities over k iterations to obtain 


1 1 


_ 2 _ 

5-Dist([/ 0 ,7/;) 2 


• k; 
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here, S k := f(X k ) — f{X*) and 5° := f(X°) — f(X*). After simple transformations, we finally obtaiij^j 


f(x k ) - f(x;) < 


I • Dist(C 7°, c/*) 2 

, , 5 Dist(C/°,C/*) 2 

K + V ' f(X°)-f(X?) 


This finishes the proof. 


7 Related work 


Convex approaches. A significant volume of work has focused on solving the classic Semi-Definite Program¬ 
ming (SDP) formulation, where the objective f (as well as any additional convex constraints) is assumed to 
be linear. There, interior point methods (IPMs) constitute a popular choice for small- and moderate-sized 
problems; see @30. For a comprehensive treatment of this subject, see the excellent survey in 153- 

Large scale SDPs pointed research towards first-order approaches, which are more computationally appealing. 
For linear /, we note among others the work of US], a provably convergent alternating direction augmented 
Lagrangian algorithm, and that of Helmberg and Rendl [361 , where they develop an efficient first-order spectral 
bundle method for SDPs with the constant trace property; see also [35] for extensions on this line of work. In 
both cases, no convergence rate guarantees are provided; see also 153- For completeness, we also mention the 
work of [32] [3T] [58] [ZQl on second-order methods, that take advantage of data sparsity in order to handle large 
SDPs in a more efficient way. However, it turns out that the amount of computations required per iteration is 
comparable to that of log-barrier IPMs [53- 

Standard SDPs have also found application in the field of combinatorial optimization; there, in most cases, 
even a rough approximation to the discrete problem, via SDP, is sufficiently accurate and computationally 
affordable, than exhaustive combinatorial algorithms. Goemans and Williamson [32. were the first to propose 
the use of SDPs in approximating graph Max Cut, where a near-optimum solution can be found in polynomial 
time. g5] propose an alternative approach for solving Max Cut and Graph Coloring instances, where SDPs 
are transformed into eigenvalue problems. Then, power method iterations lead to e-approximate solutions; 
however, the resulting running-time dependence on e is worse, compared to standard IPMs. Arora, Hazan 
and Kale in 0] derive an algorithm to approximate SDPs, as a hybrid of the Multiplicative Weights Update 
method and of ideas originating from an ellipsoid variant [Z3], improving upon existing algorithms for graph 
partitioning, computational biology and metric embedding problems PI 

Extending to non-linear convex f cases, [62 ( 163 ] have shown how IPMs can be generalized to solve instances 
of (jT]), via the notion of self-concordance; see also E1113 for a more recent line of work. Within the class of 
first-order methods, approaches for nonlinear convex / include, among others, projected and proximal gradient 
descent methods 23 EZ1 43 > (smoothed) dual ascent methods [6T| , as well as Frank-Wolfe algorithm variants 
[39] - Note that all these schemes, often require heavy calculations, such as eigenvalue decompositions, to com¬ 
pute the updates (often, to remain within the feasible set). 


Burer X Monteiro factorization and related work. Burer and Monteiro P~5i n6 popularized the idea of 
solving classic SDPs by representing the solution as a product of two factor matrices. The main idea in such 
representation is to remove the positive semi-definite constraint by directly embedding it into the objective. 
While the problem becomes non-convex, Burer and Monteiro propose a method-of-multiplier type of algorithm 
which iteratively updates the factors in an alternating fashion. For linear objective /, they establish convergence 
guarantees to the optimum but do not provide convergence rates. 

For generic smooth convex functions, Hazan in [33] proposes SparseApproxSDP algorithmp]a generaliza¬ 
tion of the Frank-Wolfe algorithm for the vector case [25], where putative solutions are refined by rank-1 approx¬ 
imations of the gradient. At the r-th iteration, SparseApproxSDP is guaranteed to compute a ^-approximate 
solution, with rank at most r, i.e., achieves a sublinear O (^) convergence rate. However, depending on e, 


13 One can further obtain a bound on the right hand side that depends on rj 
know r/ > |jrp. Thus, the current proof leads to the bound: 


_l_ 

16(M||X*|| 2 + ||V/(X*)|| 2 )- 


By 


Lemma 


A.5 


we 


f(X k ) - f(Xj) < 


^ - Dist {U°,Uj ) 2 

h . e DisT(t/°,t/;) 2 ' 
V* ' /(XO)-/(X*) 


14 The algorithm in [4] shows significant computational gains over standard IPMs per iteration, due to requiring only a power 
method calculation per iteration (versus a Cholesky factorization per iteration, in the latter case). However, the polynomial 
dependence on the accuracy parameter 1 is worse, compared to IPMs. Improvements upon this matter can be found in [5] where 
a primal-dual Multiplicative Weights Update scheme is proposed. 

15 Sparsity here corresponds to low-rankness of the solution, as in the Cholesky factorization representation. Moreover, inspired 
by Quantum State Tomography applications pQ, SparseApproxSDP can also handle constant trace constraints, in addition to 
PSD ones. 
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SparseApproxSDP is not guaranteed to return a low rank solution unlike Fgd. Application of these ideas in 
machine learning tasks can be found in m - Based on SparseApproxSDP algorithm, m further introduces 
“de-bias” steps in order to optimize parameters in SparseApproxSDP and do local refinements of putative 
solutions via L-BFGS steps. Nevertheless, the resulting convergence rate is still sublinearp’] 

Specialized algorithms - for objectives beyond the linear case - that utilize such factorization include matrix 
completion /sensing solvers 32] [B5j [52] ITT] , non-negative matrix factorization schemes [52], phase retrieval 
methods [61, T5j and sparse PCA algorithms m Most of these results guarantee linear convergence for various 
algorithms on the factored space starting from a “good” initialization. They also present a simple spectral 
method to compute such an initialization. For the matrix completion /sensing setting, [66] have shown that 
stochastic gradient descent achieves global convergence at a sublinear rate. Note that these results only apply 
to quadratic loss objectives and not to generic convex functions /□in consider the problem of computing 
the matrix square-root of a PSD matrix via gradient descent on the factored space: in this case, the objective 
/ boils down to minimizing the standard squared Euclidean norm distance between two matrices. Surprisingly, 
the authors show that, given an initial point that is well-conditioned, the proposed scheme is guaranteed to find 
an e-accurate solution with linear convergence rate. 

[23] propose a first-order optimization framework for the problem 0. where the same parametrization 
technique is used to efficiently accommodate the PSD constraint!//] Moreover, the proposed algorithmic solution 
can accommodate extra constraints on Aj//] The set of assumptions listed in [23] include—apart from A*- 
faithfulness—local descent, local Lipschitz and local smoothness conditions in the factored space. E.g ., the local 
descent condition can be established if g(U ) := f(UU T ) is locally strongly convex and Vg(-) at an optimum 
point vanishes. They also require bounded gradients as their step size doesn’t account for the modified curvature 
of f(UU T ). [//] These conditions are less standard than the global assumptions of the current work and one 
needs to validate that they are satisfied for each problem, separately. [23] presents some applications where these 
conditions are indeed satisfied. Their results are of the same flavor with ours: under such proper assumptions, 
one can prove local convergence with 0(l/e) or 0(log(l/e:)) rate and for / instances that even fail to be locally 
convex. 

Finally, for completeness, we also mention optimization over the Grassmannian manifold that admits tailored 
solvers [25] : see 03 EH DU EH E2] for applications in matrix completion and references therein. [ 35] presents 
a second-order method for ([I]), based on manifold optimization over the set of all equivalence class O. The 
proposed algorithm can additionally accommodate constraints and enjoys monotonic decrease of the objective 
function (in contrast to [15] [16]), featuring quadratic local convergence. In practice, the per iteration complexity 
is dominated by the extraction of the eigenvector, corresponding to the smallest eigenvalue, of a n x n matrix— 
and only when the current estimate of rank satisfies some conditions. 

Table [I] summarizes the comparison of the most relevant work to ours, for the case of matrix factorization 
techniques. 


Reference 

Conv. rate 

Initialization 

Output rank 

El 

1/e (Smooth /) 

X° = 0 

x h 

m 

1/e (Smooth /) 

X° = 0 

7« 

.23 

i/e (Local Asm.) 

Application dependent 

r 

m 

log(i/e) (Local Asm.) 

Application dependent 

r 

This work 

1 /e (Smooth /) 

SVD / top-r 

r 

This work 

log(i/e) (Smooth, RSC /) 

SVD / top-r 

r 


Table 1: Summary of selected results on solving variants of 0 via matrix factorization. “Conv. rate” describes 
the number of iterations required to achieve e accuracy. “Initialization” describes the process for starting point 
computation. “SVD” stands for singular value decomposition and “top-r” denotes that a rank-r decomposition 
is computed. For the case of [23], “Local Asm.” refer to specific assumptions made on the I/-space; we refer the 
reader to the footnote for a short description. “Output rank” denotes the maximum rank of solution returned 
for e-accuracy. 


16 For running time comparisons with Fgd see Sections [Cl and [H| 

17 We recently became aware of the extension of the worK ED for the non-square case X = UV T . 

18 In this work, the authors further assume orthogonality of columns in U. 

19 Though, additional constraints should satisfy the X*-faithfulness property: a constraint set on U , say IA , is faithful if for each 
U £ IA, that is within some bounded radius from optimal point, we are guaranteed that the closest (in the Euclidean sense) rotation 
of U* lies within IA. 

20 One can define non-trivially conditions on the original space; we defer the reader to m 
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8 Conclusion 


In this paper, we focus on how to efficiently minimize a convex function / over the positive semi-definite cone. 
Inspired by the seminal work nmn, we drop convexity by factorizing the optimization variable X = UU T 
and show that factored gradient descent with a non-trivial step size selection results in linear convergence when 
/ is smooth and (restricted) strongly convex, even though the problem is now non-convex. In the case where / 
is only smooth, only sublinear rate is guaranteed. In addition, we present initialization schemes that use only 
first order information and guarantee to find a starting point with small relative distance from optimum. 

There are many possible directions for future work, extending the idea of using non-convex formulation for 
semi-definite optimization. Showing convergence under weaker initialization condition or without any initial¬ 
ization requirement is definitely of great interest. Another interesting direction is to improve the convergence 
rates presented in this work, by using acceleration techniques and thus, extend ideas used in the case of convex 
gradient descent m- Finally, it would be valuable to see how the techniques presented in this paper can be 
generalized to other standard algorithms like stochastic gradient descent and coordinate descent. 

Furthermore, we identify applications, such as sparse PCA USE], that require non-smooth constraints on 
the factors U. That being said, an extension of this work to proximal techniques for the non-convex case is a 
very interesting future research direction. 
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A Supporting lemmata 

Lemma A.l (Hoffman, Wielandt [9]). Let A and B be two PSD n x n matrices. Also let A be full rank. Then, 

Tr(AB)>a min (A)Tr(B). (20) 

The following lemma shows that Dist, in the factor U space, upper bounds the Frobenius norm distance in 
the matrix A' space. 

Lemma A. 2 . Let X = UU T and X* = UfUf T be twonxn rank-r PSD matrices. Let Dist(( 7 , U*) < pa r (Uf), 
for some rotation matrix Ry and constant p > 0. Then, 

II* - Xf\\ F < (2 + p)p ■ \\Uf\\ 2 ■ a r (Uf). 

Proof. By substituting X = UU T and A* = UfUf T in ||A — Xf\\ F , we have: 

\\X-Xf\\ F = \\UU T -UfUf\\ F 

= \\UU T - UfR* u U r + UfR* u U r - UfRfj{UfRl) T \\ F 

< Dist({/, U*) ■ \\U \\ 2 + Dist(U,U*) • \\U *\\ 2 

(l < (1 + p)\\U*h ■ Dist (U,U*) + Dist (U,Uf) ■ \\Uf \\ 2 
= (2 + p) • Dist(C 7, t/*) - ||C/*|| 2 

( < (2 + p)p • 11^*11 2 • (T r {U*) 

where (i) is due to the orthogonality Rf/ R.f = I rxr , (ii) is due to the triangle inequality, the Cauchy-Schwarz 
inequality and the fact that spectral norm is invariant w.r.t. orthogonal transformations and, (in) is due to the 
following sequence of inequalities, based on the hypothesis of the lemma: 

Wh ~ Wth < W - UfRfjh < Dist (U,Uf) < P a r (Uf) 

and thus ||C /||2 < (1 + p) • ||C/*|| 2 - The final inequality ( iv) follows from the hypothesis of the lemma. □ 

The following lemma connects the spectrum of U to U* under the initialization assumptions. 

Lemma A. 3 . Let U and Uf be n x r matrices such that Dist(( 7 , U*) < pa r (U*), for p = yjjo • Withal, 

define A* = UfUf T . Then, the following bounds hold true: 

(1 - 1/100) a^Uf) < <Ji(U) < (1 + 1/100) cn(Uf), 

(1 — 1/100) a r (U*) < O r (U) < (1 + 1/100) CT r (U *). 

Moreover, by definition of t(V) := for some V matrix, we also observe: 

T(U) < ±§ ■ T(Uf) and r(X)< (AT) 2 • r (A/). 

Proof. Using the norm ordering || ■ H 2 < || • ||_f and the Weyl’s inequality for perturbation of singular values 
(Theorem 3.3.16 [37]) we get, 

Wi(U) - Vi(Uf) | < i 00 r 1 (x*) cr r(^*), 1 <i<r. 

Then, the first two inequalities of the lemma follow by using triangle inequality and the above bound. For the 
last two inequalities, it is easy to derive bounds on condition numbers by combining the first two inequalities. 
Viz., 


r(U) 


_ < l+Uioo 

<7 r (u) — 1—i/lOO 


gi(Eg) 

VRUf) 


< ini . T (U* 

— 99 ' 


while the last bound can be easily derived since r ([/*) = \J r (X *). 


□ 
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The following lemma shows that Dist, in the factor U space, lower bounds the Frobenius norm distance in 
the matrix A' space. 


Lemma A.4. Let X = UU T and X* = U*U* T 
i *Axf) T , 


P = 


100 <Ti(X*) ■ 


be two rank-r PSD matrices. Let Dist(Z7, U*) < pa r (U*), for 


\\X ~ KWl > ^ !1 Dist(17, Uf) 2 . 

Proof. This proof largely follows the arguments for Lemma 5.4 in m, from which we know that 

|\X - X*\\% > 2{V2 - l)a r (A*)DiST(C/, Uf) 2 . 

Hence, ||A — X*\\ 2 F > 3 g> h Y ) Dist(17, Uf) 2 , for the given value of p. 

The following lemma shows equivalence between various step sizes used in the proofs. 


( 21 ) 

□ 


Lemma A.5. Let X° = U°U 0T and X = UU T be two n x n rank-r PSD matrices such that Dist([7, U*) < 
Dist (U°,U*) < pa r (Uf), where p = jgg • Define the following step sizes: 

(*) S = 16(M||X 0 || a +||V/(X°5|| a )> 

(ii) rj = 


(Hi) rfi = 


’ and 

16(M||X*|| 2 + ||V/(X*)||2)- 


Then, rj > |r? holds. Moreover, assuming ||A* — AT*|| jr < ar foo apx*] ’ f°ll° w i n 9 inequalities hold: 


&*<*<&* 


A.3 


we have, 9S /ioo ||AT *|| 2 < ||X°lL < 


2 — 


Proof. By the assumptions of this lemma and based on Lemma 
103 /ioo ||AT*|| 2 ; similarly 9S /ioo ||AT *|| 2 < ||AT || 2 < 103 /ioo ||AT*|| 2 . Hence, we can combine these two set of in¬ 
equalities to obtain bounds between X° and A, as follows: 


mil* 0 ll 2 <imi2< w\\ x 


I 2 ‘ 


To prove the desiderata, we show the relationship between the gradient terms ||X f(X)QuQjj\\ 2 , ||V/(AT 0 )|| 2 
and || V/(AT*)|| 2 - In particular, for the case fj> we have: 

||V/(AT)Q c/ Q ^|| 2 < ||V/(AT )|| 2 < ||V/(A) - V/(A °)|| 2 + ||V/(AT °)|| 2 

(ii) 

< M\\X — X°|| F + ||V/(AT 0 )|| 2 

( < } iWf||AT - AT*||^ + M\\X° - Xf\\ F + ||V/(AT °)|| 2 

( iv ) 

< 2M(2 + p)p\\Uf\\ 2 ■ a r (Uf) + ||V/(AT 0 )|| 2 

< 2 M ■ (2 + jig) • Tfjn||AT *|| 2 + || V/(AT 0 )|| 2 


< ill^°|| 2 


100 / 100 
l|V/(AT°) 


where (z) follows from the triangle inequality, (ii) is due to the smoothness assumption, (Hi) is due to the 
triangle inequality, (iv) follows by applying Lemma A.2 on the first two terms on the right hand side and, (v) 
is due to the fact ||L /*|| 2 ■ a r (U*) < ||AT*|| 2 and by substituting p = < Ag_ Last inequality follows 

from 98 /ioo ||AT *|| 2 < ||AT 0 1| Hence, using the above bounds in step size selection, we get 


77 = 


> 


16(M||X|| 2 +|| V/(X)Q(/Q J|| 2 ) 16 ^6M||_yo|| 2 4.||v/(X 0 )|| 2 


> h, 


where (i) is based also on the bound ||A|| 2 < -^||A°|| 2 . 
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Similarly we show the bound if 77 * < rj < if 77 *. First observe that, 

l|V/(A°)|| 2 < IIV/(A r *) - V/(A°)|| 2 + ||V/(A r *)|| 2 
<M||A r *-A°|| F + ||V/(A r *)|| 2 

< M\\X* - A°|| F + M\\X* - A'*||f + ||V/(A*)|| 2 

< M {2 + p)p ■ \\u ;\\ 2 ■ °r{U* r ) + Ma r (X*) + ||V/(A*)|| 2 

< ^M||A *|| 2 + ||V/(A*)|| 2 . 

Combining the above bound with 98 /ioo || A '*|| 2 < ||A '°|| 2 < 103 /ioo || A *|| 2 gives, rj > ^ 77 *. Similarly we can 
show the other bounds. □ 


B Main lemmas for the restricted strong convex case 

In this section, we present proofs for the main lemmas used in the proof of Theorem |4.2[ in Section [ 6 ] 


B.l Proof of Lemma 16.21 


Here, we prove the existence of a non-trivial lower bound for (V/(A), A' — A*). Our proof differs from the 
standard convex gradient descent proof (see [BOl ), as we need to analyze updates without any projections. Our 


proof technique constructs a pseudo-iterate to obtain a bigger lower bound than the error term in Lemma 6.3 
Here, the nature of the step size plays a key role in achieving the bound. 

Let us abuse our notation and define U + = U — rjS7 f{X)U and X + = U + U +T . Observe that we use the 
surrogate step size 77 , where according to Lemma A.5 satisfies rj > By smoothness of /, we get: 


/(A) > /(A+) - (V/(A), A+ - A) - f ||A+ - A||| 

> /(A*) - (V/(A), A+ - A) - f || A+ - A|| 2 f , (22) 

where ( i ) follows from optimality of X* and since A + is a feasible point (A + >; 0) for problem ([l]). Further, 
note that A* is a PSD feasible point. By smoothness of /, we also get 


/(A*) < /(A*) + (V/(A*), A* - A*) + f ||A* - A 


II 


= /(**) + f I 


a; - A* 


II, 


(23) 


where (*) is due to KKT conditions JT2]: since V/(A*) is orthogonal to A'*, it is also orthogonal to the n — r 
bottom eigenvectors of A'*. Viz., (V/(A*), A* — A*) = 0. Finally, since rank(A'*) = r, by the (m, r)-restricted 
strong convexity of /, we get, 


/(A*) > /(A) + (V/(A), A* - A) + f ||A* - A 


ll ■ 


Combining equations (22), (23), and (24), we obtain: 

<V/(A), A - A'*} > (V/(A), A - A+) - f || A+ - X\ 


F + f |[X r *-X|| 




\\2 

||f- 


(24) 

(25) 


It is easy to verify that A + = X — rjS7 f{X)Xk — r}A T XX f(X), where A = J— fQc/Qj/V/(A') £ K" xrl . Notice 
that, for step size rj , we have 


A >~ 0, ||A || 2 < 1 + V 32 ; and cr„(A) > 1 — 1 /32. 


Substituting the above in (251, we obtain: 

<V/(A), A - A*) - f ||A* - A|| 2 f + f || A, 


X* 


2 

F 


> 277 (V/(A), V/(A)AA) - f ||2f7V/(A)AA||| 

= 27?Tr(V/(A)V/(A)AA) - 2 Mrj 2 ||V/(A)AA||| 


(ii) 0 

> 277 Tr(V/(A)V/(A)A) ■ a„(A) - 2 M 77 2 ||V/(A)[/||| ||t/|| 2 ||A || 2 

> ^||V/(A)C/||| - 2 Mrj 2 • (I)' • l|V/(A)C/||| ||£/|| 2 
= ^rj\\Xf{X)Uf F (l - 2Mfj (|) 2 • if • ||A|| 2 ) 

{l > ^||V/(A)C/|||, 
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where (i) follows from symmetry of Vf(X) and X, and (ii) follows from 

Tr(Vf(X)Vf(X)XA) = Tr(Vf(X)Vf(X)UU T ) - | Tr(Vf(X)Vf(X)UU T Vf(X)) 
>(l-^\\QuQlXf(X)\\ 2 )\\Vf(X)U\\ 2 F 

>(1-V32)||V/(X)C/|||. 

Finally, (Hi) follows by observing that rj < \ eM n x \u ■ Thus, we achieve the desiderata: 

<v/(x),x - x:) > i§\\vf(x)u\\ 2 F + f II a;* - xf F - f\\x* - x*\\ 2 F . 

This completes the proof. 


B.2 Proof of Lemma 16.31 

We lower bound (V/(X), AA T ) as follows: 


(V/(A),AA t > i <Q a QIV/(A),AA t ) 

> -|Ti-(Q a QIv/(A)AA t )| 

(ii) 

> -||Q A QlV/(X)|| 2 Tr(AA T ) 

(iii) / \ 

> - (WQuQljVf(X)h + \\Qu*Qlr;Vf{X)h) Dist (U, P*) 2 . (26) 


Note that (i) follows from the fact A = QaQ\X and (ii) follows from | Tr(A.B)| < ||A ||2 Tr(i3), for PSD matrix 
B (Von Neumann’s trace inequality I55| ). For the transformation in (Hi), we use that fact that the column 
space of A, Span(A), is a subset of Span (U U [/*), as A is a linear combination of U and U*R F . 

To bound the first term in equation (26), we observe: 


\\QuQuVf(X)h ■ Dist (U, U * r ) 2 (27) 

= V ■ 16 (m||X || 2 + ||Q£/QjV/(A)|| 2 ) • \\QuQvVf(X)b ■ Dist (U, Ujf 

= V I l6 M\\X\\ 2 \\QuQjj\/f(X)\\ 2 ■ Dist(I 7, t/ r *) 2 +16\\QuQuV f(X)\\ 2 2 • Dist(V, U* r ) 2 ] 


At this point, we desire to introduce strong convexity parameter m and condition number n in our bound. In 
particular, to bound term A , we observe that \\QuQu^f(X)\\ 2 < or WQuQu^ f(X)\\ 2 > . This 

results into bounding A as follows: 

M\\X\\ 2 \\QuQJjX f(X)\\ 2 • Dist([7, U*) 2 

< ma { Kii ' M| ^"" r111 ■ Dist(?7, Ujf, rj- 16 • 40kt(X)||Q £/ QJv/(X)||^ • Dist (U, C/*) 2 } 

< 16 ^- M II X 4 IA—(a-) . Dist(C ; ) jj*f 40kt(A)||Q C7 QOv/(A)||| • Dist([7, Ujf. 


Combining the above inequalities, we obtain: 


\QuQuVf(X)\\ 2 -DiST(U,u; 

(i) 


(28) 


< ■ Dist([7, U*) 2 + (40 kt(X) + 1) ■ 16 • rj\\QuQjjV f(X)\\l • Dist (U,U* r f 


(ii) 

— 40 
(Hi) 

— 40 

(iv) 

< 


^ ■ Dist(£7, U:f + (41kt(X*) + 1) • 16 • rj\\QuQlX }(X)\\l • (p') 2 <j r (X*) 
• Dist(V, U:f + 16 ■ 42 • rj ■ KT (X* r ) ■ \\Vf(X)U\\% • 


^ • Dist([7, U*) 2 + § ■ \\Vf(X)U\\ 


2 

F) 


(29) 


where ( i) follows from rj < 16M || y|| 9 > (**) is due to Lemma A.3 and bounding Dist (U,U*) < p'a r (U*) 
by the hypothesis of the lemma, (Hi) is due to a r (X*) < 1.1 cr r (X) by Lemma |A.3 and due to the facts 
a r (X)\\QuQjjXf(X)\\l < \\U T Xf(X)\\ F and (41 kt(X*) + 1) < 42kt(A'*). Finally, (iv) follows from substitut¬ 
ing p' and using Lemma |A.3| 
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Next, we bound the second term in equation (26): 


\\Qu*RQu* R Vf(X)\\ 2 ■ Dist (U, U* r ) 2 

< ||V/(X) — V/(X*)|| 2 • Dist(U,£/;) 2 

< II V/(X) - V/(X*)|| F ■ Dist (U, V *) 2 
(a) 

< M (\\X - X*|| F + \\X* - X*|| F ) ■ Dist (U, U* r ) 2 

(Hi) 

< M (2 + p') ■ p' ■ \\U*\\ 2 ■ a r (U*) • Dist((7, U*) 2 + M\\X* - X*\\ F • Dist(J7, U*) 2 

( iv) 

< M (2 + p'Wth iooJrp^ rW) ■ Dist (U, U* r ) 2 + M\\X* - X r *|| F • Dist(J 7, t / r *) 2 

< m ^ 4 ( 0 Y>) DiST(b r , U*) 2 + M\\X* - X*|| F ■ Dist (U, V r *) 2 , 


(30) 


(31) 


where (i) follows from V f(X*)X* = 0, (ii) is due to smoothness of / and (in) follows from Lemma A.2 Finally 
(iv) follows from Dist(/7, U*) < p'a r (U*) and substituting p' = 10 o K t(ii*) • 

Substituting (29), (31) in (26), we get: 

(v/(A'),AA t ) > - (§§ ||V/(X)[/|| 2 F 4 

This completes the proof. 


r(X*) 


Dist([7, U*) 2 + M || A* - A'*||f • Dist([7, U*) 2 ) 


B.3 Proof of Lemma 16. II 

Recall U + = U — rjS7 f(X)U. First we rewrite the inner product as shown below. 

±(U-U+,U- U:R* v ) = (Xf(X)U, U - U*H^j) 

= (V/(X), X — U*R f U t ) 

= \ <v/(x),x - x*) + (xf(x), l(x + x*) - u;r!jU t ^ 
= \ <v/(x), x - x, *) + \ <V/(X), aa t ) , 


which follows by adding and subtracting \X*. 


Let, 77 = 


16(M\\X\\ 2 + \\Vf(X)Q u QZ\\ 2 ) 


Using Lemmas 


6.2 


and 


6.3 


we have: 


(Xf(x)u, u - u;rd 

> if • l|V/(X)J/|| F + f \\X - X* r f F - f ||X* - X*\\ 2 F 

- \ (I • l|V/(X)U||! + • Dist(U, u :) 2 + M||X* - X r *|| F • Dist (U, t/ r *) 2 ) 

= (jo - ' *?l|V/P0U|| F - f ||X* - X r *||* 

+ f (||X - X*||^, - • Dist(J7, U*) 2 - 2« • \\X* - X*|| F • Dist(U, U r *) 2 ) 


(32) 


(* 


L r ll F 


> f • llv/(x)u|| 2 F - f ||x* - x; 

+ f (||X - X*||^ - ■ Dist([/, t/ r *) 2 - pp ■ Dist(U, t/*) 2 ) 


> f • || V/(X)[/||| + M . CTr (X*) ■ DiST(t/, U *) 2 - f ||X* - X 


r llF 


where (i) follows from || X* — X*|| < ipp^ii ( pp y — ioo«fii — PcL ^ an d (m) follows from Lemma 
the result follows from rj > |r/ from Lemma 


A.5 


A.4 


Finally 


C Main lemmas for the smooth case 


In this section, we present the main lemmas, used in the proof of Theorem 4.1 in Section[ 6 j First, we present a 
lemma bounding the error term (V/(X), AA T ), that appears in eq. (18). 
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Lemma C.l. Let f be M-smooth and X = UU T ; also, define A := U — UfRfj. Then, for Dist([/, U*) < 
pa r (C/*) and p = ’ ^ ie following bound holds true: 

(V/(X), AA t ) < ^\\Vf(X)U\\ 2 ■ Dist({7, U*). 

Proof. By the Von Neumann’s trace inequality for PSD matrices, we know that Tr(AB) < Tr(A) • ||S|| 2 , for A 
PSD matrix. In our context, we then have: 

(V/(X),AA t ) < ||V/(X)Q A Ql|| 2 -Tr(AA T ) 

< (jlXfWQuQZh + \\Vf(X)Q u .Q^\\ 2 ) . Dist([/)C/ * ) 2 ) (33) 

where, (i) is because A can be decomposed into the column span of U and [/*, and the orthogonality of the 
rotational matrix Rjj *. In sequence, we further bound the term ||X f(X)Qu*Qjj t || 2 as follows: 

\\xf(x)u;\\ 2 < \\xf(x)u\\ 2 + ||v/(x)A|| 2 

(a) 

< \\Xf(X)U\\ 2 + ||V/(X)QaQI|| 2 ||A|| 2 

(Hi) / \ 

< \\Vf(X)U\\ 2 + [\\Xf(X)Q u Qj J \\ 2 + \\Xf(X)Qu*Qu* || 2 J ||A|| 2 

< \\Xf(X)U\\ 2 + (\\Xf(X)Q u QZh + \\Xf(X)Q Uf Q^\\ 2 ) ^a r (Uf) 

< \\Xf(X)U\\ 2 + • ^\\Xf(X)U\\ 2 + I i 0 ||V/(A)C/*|| 2 

V- 1 - 100,/ 

< m\\vf(x)u\\ 2 + 1 ^\\vf(x)u:\\ 2 . 

where (i) is due to triangle inequality on UfRfj = U — A, (ii) is due to generalized Cauchy-Schwarz inequality; 
we denote as QaQa the projection matrix on the column span of A matrix, (in) is due to triangle inequality 
and the fact that the column span of A can be decomposed into the column span of U and I/*, by construction 
of A, (iv) is due to 

||A|| 2 < Dist (17, Uf) < ■ °r(Uf) < ^ • a r (Uf). 

Finally, (v) is due to the facts: 

II V/(X)t/*|| 2 = \\S7 f (X)Qu*Qjj+Uf\\ 2 > \\Xf(X)Qu*Qjj* || 2 • a r (Uf), 

and 


\\Vf(X)U\\ 2 = \\Xf(X)Q u Q T u U\\ 2 > \\Xf(X)QuQl\\ 2 ■ a r (U) 

> \\Xf(X)QuQl\\ 2 ■ (1 - I i g ) • a r (Uf), 

by Lemma A.3 Thus: 

\\Xf(X)Q u ,Q T Uf \\ 2 < -^\\Vf(X)Uf\\ 2 

< Pf) wlivjw /|| 2 

< PpW\\^f( x )QuQuh, 

and, combining with ( |33| ), we get 

<V/(X), AA t ) < (^§ + 1) . ■ \\Vf(X)QuQl\l 2 ■ DiST(f 7, Uff 

< i||V/(A)C/|| 2 -DiST(C/,17*). 

The last inequality follows from Dist([/, Uf) < ^ a[{x*] ’ a r(Uf). This completes the proof. 


(34) 


□ 


The following lemma lower bounds the term (X f(X), AA T ); this result is used later in the proof of Lemma 

El 

Lemma C.2. Let X = UU T and define A := U — UfRf,. Let f(X + ) > f(Xf), where Xf is the optimum 
of the problem (jT]). Then, for Dist(17, Uf) < pu r (Uf), where p = ji(~y*) > anc ^ / a M-smooth convex 

function, the following lower bound holds: 


<V/(X),AA t ) > ^ ■ \(X f(X),X — X*)|. 

v 1 inn 


24 















Proof. Let the QR factorization of the matrix [U UfRy] 2r be Q ■ i?, where Q is a n x 2r orthonormal matrix 
and R is a 2r x 2r invertible matrix (since [U UfR(f] is assumed to be rank-2r). Further, let [U Uf Ry] 2rxn 

where C'l denotes the pseudo-inverse of matrix C. It is obvious that [U • ([17 UfRf,]^ = I 2 rx 2 r- 

Given the above, let us re-define some quantities w.r.t. [U UfRfj], as follows 


A = U~ UfR* v = [U UfRfj\ nx2r 
Moreover, it is straightforward to justify that: 


-Ir 


X-Xf = [U UfRfj) 


nx2r 


I r xr Orxr 
Orxr J-rxr 


rXr J 2rxr 


[U UfRfj] 


2 rxn 


Then, from the above, the two quantities X — Xf. and A are connected as follows: 


(X — X*) • ([17 UfRf]^ 


= {U UfRf] 


I 0 ' 
0 -I 


[U u:r* L7 ] t ■ ([17 U* r Rfj] ] ) 


which is equal to A. Then, the following sequence of (in)equalities holds true: 


<V/(A), AA t ) ^ (vf(X), (X - Xf) ■ ([U UfRtf) 

( 


A 1 


> 


(*■ 


Tr 


V 


=A 


(|C7 UfR^y ■ 

1 _ 

• A t 





—B 


> — |Tr (V/(X) • (X — V*))| 


( >-| (Vf(X),X-X* 


{ > -V2-\(Vf(X),X-X* r )\. 
( > -V2-\(X.f(X),X-X* r )\ 


{[U u;r*J) 

{[U UfRtf) 

{[u u;r*J) 


A 1 


12 

|A|| 2 


/o_ 1 a r(U*) 100 

vz 100 


y/2 


fr\ _ _ 1 _ 100 

vz 100 


|<V/(X),X-X*)| 


ioo ' cr r{U r ) 

Too ' a r{U r ) 


(35) 


where, (i) follows by substituting A, according to the discussion above, ( ii ) follows from symmetry of V/(X), 
(Hi) follows from the Von Neumann trace inequality Tr(AB) < Tr(A)| |B| | 2 , for a PSD matrix A; next, we show 
that y 1 Ay > 0, \/y and A := Vf(X) • (X — X*), (iv) is due to successive application of the Cauchy-Schwarz 
inequality, (v) is due to [I /] T = \f2 and ||A || 2 < Dist(/7, 17*) < p ■ a r (Uf) < • a r (Uf), ( vi ) follows 

from the the following fact: 


7 - 1 — W]r =a r ([U UfRfj]) 

= a r ([U UfEfjj-pfRl UfRfj) + [U*Rp UfRfj}) 
= a r ([U -UfRfj 0 j + WfRfj UfRfj}) 

> *r ([UfRfj UfRfj}) -|| U- UfRfj \| 2 
( = V2 • oy (U*R) - || 17 - U* r R* v \\ 2 


where, (i) follows from a variant of Weyl’s inequality, (ii) is due to oy (\UfR\j UfR^]) = \/2 ■ ay (17*), (in) 
follows from the assumption that || 17 — UfRfj\\ 2 < Dist(17, 17*) < y^-oy (17*). The above lead to the inequality: 


(V UfRtr) ] 


[U UfRfjY 


< 


^ 100 


r(U*) ' 
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In the above inequalities (35), we used the fact that symmetric version of A is a PSD matrix, where 
A := Vf(X)A(U+U*R ij) t = V/(A)-(A—A*) is a PSD matrix, i.e., given a vector y , y T Xf(X)-(X—X*)y > 0. 
To show this, let g(t) = /(A + tyy T ) be a function from M —> R. Hence, Xg(t) = (V /(X + tyy T ),yy T ). Now, 
consider g restricted to the level set {t : /(A + tyy T ) < /(A)}. Note that, since / is convex, this set is convex 
and further X belongs to this set from the hypothesis of the lemma. Also f(X*) < /(A + tyy' ), for t in this 
set from the optimality of X*. Let t* be the minimizer of g(t) over this set. Then, by convexity of g, 


<V/(A), yy T ) • -t* = V 5 (0) • (0 - O > g( 0) - g(t*) > 0. 


Further, since g(t*) = f(X +t*yy T ) > /(A*), X+t*yy T —X * is orthogonal to y. Hence, (X+t*yy T — X*)y = 0. 
Combining this with the above inequality gives, (V/(A), (X — Xf)yy T ) > 0. This completes the proof. D 


We next present a lemma for lower bounding the term (V/(A), X — X*). This result is used in the following 
Lemma C.4 where we bound the term (V/(A')/7, U — U*R^j). 


Lemma C.3. Let f be a M-smooth convex function with optimum point Xf. Then, under the assumption that 
f(X + ) > /(A'*), the following holds: 


(V/( A), A-A*) > if||v/(A)[/|||. 


Proof. The proof follows much like the proof of the Lemma for strong convex case (Lemma |6.2[ ), except for the 
arguments used to bound equation (22). For completeness, we here highlight the differences; in particular, we 
again have by smoothness of /: 


/(A) > /(A+) - (V/(A), A+ -X)-f ||A+ - X\\ J p , 

where we consider the same notation with Lemma |6.2| By the assumptions of the Lemma, we have f(X + ) > 
/(A*) and, thus, the above translates into: 


f(X) > f(Xf) - <V/(A), A+ - A) - f ||A+ - A| 


hence eliminating the need for equation ( |23[ ). Combining the above and assuming just smoothness (i.e., the 
restricted strong convexity parameter is m = 0), we obtain a simpler version of eq. (25): 


(V/(A), A - A*} > <V/(A), A - A+) - f || A+ - A 
Then, the result easily follows by the same steps in Lemma [672} 


| 2 

If ' 


(36) 

□ 


Next, we state an important result, relating the gradient step in the factored space U + — U to the direction 
to the optimum U — U*. The result borrows the outcome of Lemmas 

Lemma C.4. Let X = UU T and define A := U—UfRf,. Assume f(X + ) > /(A*) and Dist(17, Uf) < pa r (Uf), 
where p =? ygo ^[x*] • ^or f being a M-smooth convex function, the following descent condition holds for the 
U-space: 


(V/(A)£/, U — UfBfj) > l ■ ||V/(A)17|| 2 f . 


Proof. Expanding the term (V/(A)[/, U — UfRff), we obtain the equivalent characterization: 


(V/(A){7, U - UfBfj) = (V/(A), A - UfRfjU T ) 

= \ (V/(A), A - A*) + (V/(A), |(A + A*) - UfR^lJ T ) 

= l (V/(A), A - A*) + i <V/(A), AA t ) (37) 


which follows by the definition of A and adding and subtracting \Xf term. By Lemma 
first term on the right hand side as: 


C.3 


we can bound the 


I (V/(A), A - A*) > m ■ ||V/(A)C/|| 


If- 


(38) 


Observe that (V/(A), A — A*) > 0. By Lemma C.2 we can lower bound the last term on the right hand side 
of (|37| as: 


i <V/(A), AA t ) > • - ^ |(V/(A), A — A*)| = ' 500 <V/(A), A - X* r ). 


^ 100 


^ 100 


(39) 
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Combining (381 and (39) in (37), we get: 


<v/(W u - u;r* v ) > i <v/(x ),x - x*) - ^ <v/(x),x - x* r ) 

V1 100 

> (i “ • tso) • 5 <WP0, A - A ") 

> • W • l|V/(X)C/||^ 

> ir i • |HIV/(X)E/||^ 

> ig • l|V/(X)C/||| > i ■ \\Vf(X)U\\%, 


where (i) follows from rj > %rj in Lemma A.5 This completes the proof. 


□ 


We conclude this section with a lemma that proves that the distance Dist(/7, U*) is non-increasing per 
iteration of Fgd. This lemma is used in the proof of sublinear convergence of Fgd (Theorem|4.1[), in Section [6] 


Lemma C.5. Let X = UU T and X + = U + ( U + ) T be the current and next estimate of Fgd. Assume f is a 
M-smooth convex function such that f{X + ) > /(X*). Moreover, define A := U — UfRfj and Dist(C/, Uf) < 
per,. ([/*), where p = fgo y*) • Then, the following inequality holds: 


Dist(/7 + , U*) < Dist (U,U*). 


This further implies Dist(/7, U*) < Dist(C/°, Uf) for any estimate U of Fgd. 

Proof. Let Bfj = argmin^go || U — U*R\\p. Expanding Dist(/7 + , Uf) 2 , we obtain: 

Dist ([/+, Uf) 2 = min \\U + - UfR\\ 2 F (40) 

R€lO 

< \\U + -UfRlWl 

= \\U + ~U + U -UfR* v \\ 2 F 

= \\U + - U\\ 2 f + || U - UfRfjWl -2 (t/+ - U, UfRfj - U) 

= V 2 \\ Vf(X)Uf F + Dist([7, Uf) 2 - 2 V (Xf(X)U, U - UfRfij) 

< DiST(t/,t/ r *) 2 , (41) 

where last inequality is due to Lemma [CT4| □ 


D Initialization proofs 

D.l Proof of Lemma 15.II 

The proof borrows results from standard projected gradient descent. In particular, we know from Theorem 3.6 
in [2] that, for consecutive estimates X + ,X and optimal point A'*, projected gradient descent satisfies: 

||X+-X*|||< (1-I).||X-X*|||. 

By taking square root of the above inequality, we further have: 

||* + - X*\\ F < ■ \\X - X*\\ F < (l - £) • \\x - X*\\ F , ( 42 ) 

since ^Jl — ^ < 1 — for all values of k > 1. 

Given the above, the following (in)equalities hold true: 

||x - x+|| F = ||x-x* + x*-x + || F 

> ||x-x*|| F - jx+-x*|| F 

(a) i 

> yJ x ~ x ^^ 

||X-X*|| f <2k-||X-X+|| f , 
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where (i) is due to the lower bound on triangle inequality and (ii) is due to (42). Under the assumptions of the 
lemma, if ||X — X + || F < ) °v(A'), the above inequality translates into: 

W X ~ X *Wf < 

By construction, both X and X* are PSD matrices; moreover, X can be a matrix with rank(X) > r. Hence, 

||X - X*\\F < V^\\Xr - X*\\ 2 < 2y/i\\X - X*\\ 2 < 2y/¥\\X ~ X*\\ F , 

using Weyl’s inequalities. Thus, \\X r — A'*|| F < T (x ) a r(X). Define X r = U r Uj and X * = U* (U*) T . Further, 
by Lemma 5.4 of in!, we have: 


IX. - X* 


If > 


^2 (y/2- l)(T r {U*) • DlST(U r , U*). 


The above lead to: 


2 (72-1 )<r r {Ut) ■ Dist (U r ,U*) < ^a r (X). 


Recall that a r (X ) = a 2 (U r ); then, by Lemma 


A.3 


Combining all the above, we conclude that there is constant c' > 0 such that: 

Dist (U r ,U*) < ^a r (U'*). 


there is constant c" > 0 such that 


D.2 Proof of Theorem 15.21 

Recall X° = V + ^ || V ^ 0 ^7/(°eie' ^ . Here, we remind that V+(-) is the projection operator onto the PSD cone 

and V-(-) is the projection operator onto the negative semi-definite cone. 

To bound ||X° — A* || F , we will bound each individual term in its squared expansion 

||A 0 - A*||| = ||X°||! + ||X*|| F - 2 (A 0 , A*). 

From the smoothness of /, we get the following: 


M ||A *|| F > ||Vy(0) — V/(A *)|| F > ||P_(V/(0))-P_(V/(X* 


= r-(v/(o))n F . 


where (i) follows from non-expansiveness of projection operator and (ii) follows from the fact that V/(A*) 
is PSD and hence 7 ? _(V/(A*)) = 0. Finally, observe that U_(V/(0)) = V+(— V/(0)). The above combined 
imply: 


||U + (-V/(0))|| F <M||X* 


A u 


If< 


M 


l|V/(0)-V/(e ie ;)|| F 


| A* 


< At || A* 


where we used the fact that m < ||V/(0) — V/(eie 1 r )|| F < M and n = M /m. Hence ||A°|| F < k 2 ||A* 
Using the strong convexity of / around A*, we observe 


IF’ 


/(0) > f(X*) + (V/(A*), 0-A*) + y \\X*\f F > /(A*) + | ||A*" 2 


F ! 


where the last inequality follows from first order optimality of A*, (V/(A*),0 — A*) > 0 and 0 is a feasible 
point for problem ([I]). Similarly, using strong convexity of / around 0, we have 

f(x *) > /(0) + (V/(0), A*) + J ||A*|| f 

Combining the above two inequalities we get, (—V/(0), A*) > m ||X*|| F . Moreover: 

<-V/(0),X*) = (V+ (—V/(0)) + V- (—V/(0)), A*) 

= (v+ (—V/(0)), A*) + (v- (—V/(0)), A*) 


<o 


since A* is PSD. Thus, (U+(—V/(0)), A*) > (—V/(0), A*) and 


(A 0 , A*) > 


l|V/(0)-V/(e,e;)|| r l ^ l &^ l ^ l &' 


(43) 
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where we used the fact that m < ||V/(0) — V/(eie ] r )|| F < M. Given the above inequalities, we can now prove 
the following: 

||X° -X*||* = ||X°|| F + \\X*f F -2(X°,X*) 

<k 2 ||V*|| F + ||X*|| F — — ||X*|| F 

K 

= (« 2 -! +i ) 

Now we know that ||X° — X*|| F < \J n? — 2 /k + 1 ||X*|| F . Now, by triangle inequality ||X° — X*|| F < 
vV- 2 / K + l||X*|| F + U*-X*\\ f . By ||.||2 < ||-||f and Weyl’s inequality for perturbation of singular values 
(Theorem 3.3.16 [37.) we get, 

IK - X*\\ 2 < 2\/ft 2 — 2 /k + 1 ||X*|| F + 2 \\X* - X r *|| F . 

By the assumptions of the theorem, we have ||X* — A'*|| F < p ||X*|| 2 . Therefore, 

||K - X r *|| F < 2 v / 2r (K 2 - 2 A+1 ||X*|| f + p ||X*|| 2 ) . 

Now again using triangle inequality and substituting we get ||X*|| F < srank 1 / 2 ||X*|| 2 + /5||X*|| 2 . Finally 
combining this with Lemma |A.4| gives the result. 


E Dependence on condition number in linear convergence rate 

It is known that the convergence rate of classic gradient descent schemes depends only on the condition number 
Ac = — of the function /. However, in the case of Fgd, we notice that convergence rate also depends on 
condition number r(X*) = A(v«] ; as well as ||V/(X*)|| 2 . 

To elaborate more on this dependence, let us recall the update rule of Fgd, as presented in Section [3] 
In particular, one can observe that the gradient direction has an extra factor U, multiplying V/(t//7 T ), as 
compared to the standard gradient descent on X. One way to reveal how this extra factor affects the condition 
number of the Hessian of /, we consider the special case of separable functions; see the definition of separable 
functions in the next lemma. Next, we show that the condition number of the Hessian - for this special case - has 
indeed a dependence on both t(X*) and ||V/(X*)|| 2 , a scaling similar to the one appearing in the convergence 
rate a of Fgd. 

Lemma E.l (Dependence of Hessian on t(X*) and ||V/(X*)|| 2 ). Let f be a smooth, twice differentiable 
function over the PSD cone. Further, assume f is a separable function over the matrix entries, such that 
f(X) = E(« j) ‘Pij(Xij), where ( i,j) £ [n] x [n], and let < Pij’s be M-smooth and m-strongly convex functions, 
Vi,j. Finally, let X * = U*(U*) T be rank-r and let V FF /(X) denote the Hessian of f w.r.t. the U factor and 
up to rotations, for some rotation matrix R £ O. Then, 

a! (VA/PH) < c ■ (M||X*|| 2 + ||V/(X*)|| 2 ), 

for constant C. Further, for any unit vector y £ R nrxl such that columns of mat(y) £ K nxr are orthogonal to 
U*, i.e., mat(y) T U* = 0, we further have: 

y T X 2 utR f{X*) y >c-mo r {X*), 


for some constant c. 

Proof. By the definition of gradient, we know that Vuf(UU T ) = (V f(UU T ) + Xf(UU T ) T )U; for simplicity, we 
assume V f{UU T ) be symmetric. Since X is symmetric, with V f(UU T )ij = ^L(X^) and ipF{Xij) = tpF(Xji). 
By the definition of Hessian, the entries of V F /(C/C/ T ) are given by: 


(yll{uu T )) w 


eu. 


kl 


E 

p =i 


ip 


(Xi P )Upj 


n 


E 


WipjXip) 

dU k i 

^Ti 


Upj + El Pip(Xip) 

P= 1 

X V 

■=T 2 


dUpj 

dU kl ■ 


In particular, for Tf we observe the following cases: 


f ip'l k {X ik )UuU k j if * A A 

\ Ep Tip(Xip)UpiU P j + <p't(Xii)UuU k j if i = k. 
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while, for T 2 we further have: 


r f 0 if j A l, 

2 \ <p' ik (X ik ) if j = l. 

Consider now the case where gradient and Hessian information is calculated at the optimal point A*. Based 
on the above, the Hessian of / w.r.t U* turns out to be a sum of three PSD nr x nr matrices, as follows: 

V 2 u *f(X*) = A + B + C, 


where 

(*) A = (U*) T GU* , where G is a n 2 x n 2 diagonal matrix with diagonal elements </5"-(A,*-) and U * is a n 2 x nr 
matrix with U* repeated n times on the diagonal. It is easy to see that 

\\A\\ 2 <\\rf j \\ 00 a^(U*) 2 = M\\X*\\ 2 . 

Similarly, we have cr nr (A) > mimp"- • a m i n (U *) 2 = ma m i n (X*). 

(ii) B is a nr x nr matrix, with Bijj~i = l Pik( X ik) U *i U kj- Again, it is easy to verify that ||R|| 2 < M||A*|| 2 . 
Now for y perpendicular to U*, notice that y 1 By = 0, since the columns of B are concatenation of scaled 
columns of U*. 

(Hi) C is a nr x nr diagonal block-matrix, with n x n blocks V/(A'*) repeated r times. It is again easy to 
see that || C7|| 2 < ||V/(A*)|| 2 , since C is a block diagonal matrix. Moreover, by KKT optimality condition 
V/(A*)A* = 0, rank(V/(A*)) <n — r and thus, a nr (C) = 0. 

Combining the above results and observing that all the three matrices are PSD, we conclude that a 1 (V^* /(**)) < 
C • (M||A*|| 2 + ||V/A*|| 2 ). Regarding the lower bound on a nr (V 2 f ./(A*)), we observe the following: due to 
UU T factorization and for U* optimum, we know that also U*R is optimum, where gradient V f(U*(U*) T ) = 

V f(U*RR r (U*) T ) = 0. This further indicates that the hessian of / is zero along directions corresponding to 
columns of I/*, and thus a nr (V^,/(A*)) = 0 along these directions; see figure H (right panel) for an example. 
However, for any other directions orthogonal to [/*, we have y T (V 2 r* R /(A*)J;y > c • m<r m i n (A'*), for some 
constant c. This completes the proof. 

□ 

To show this dependence in practice, we present some simulation results in Figure [4] We observe that the 
convergence rate does indeed depend on r( X*). 


F Test case I: Matrix sensing problem 


In this section, we briefly describe and compare algorithms designed specifically for the matrix sensing problem, 
using the variable parametrization A = UU T . To accommodate the PSD constraint, we consider a variation of 
the matrix sensing problem where one desires to find X* that minimizej 2 ^! 


minimize 

xeR™ x ™ 


X\\b — A (A) |||. subject to rank(A) < r, X >2 0. 


(44) 


W.l.o.g., we assume b = A(X*) for some rank-r A*. Here, A : 
such that the i -th entry of A(X) is given by (Ai, A), for Ai £ B 
matrices. 


n —>• R p is the linear sensing mechanism, 
sub-Gaussian independent measurement 


] is one of the first works to propose a provable and efficient algorithm for (44), operating in the [/-factor 
while [SB] solves (44) in the stochastic setting; see also [52117T1125] . To guarantee convergence, most of 


space, 

these algorithms rely on restricted isometry assumptions; see Definition F.l below. 

To compare the above algorithms with Fgd, Subsection |F.l 
convexity and its connection with the RIP. Then, Subsection 


aforementioned algorithms, 
assumed, for each case. 


further describes the notion of restricted strong 
F.2| provides explicit comparison results of the 
with respect to the convergence rate factor a, as well as initialization conditions 


1 This problem is a special case of affine rank minimization problem m, where no PSD constraints are present. 
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Figure 4: Left panel: Assume dimension n = 50. We consider the matrix sensing setup [65] and generate 
m = [2nlogn] Gaussian linear measurements of n x n matrices X* of rank r = 2, with varying condition 
number t(X*). We compute matrix X = UU T , U is n X r tall matrix, by minimizing the standard least squares 
lost function, using our scheme. In the plot, we show the log error versus total number of iterations. Observe 
that, varying the condition number of X*, higher r(X*) leads to slower convergence. Right panel: Contour 
of function ( u\ + u\ — l) 2 . Observe the “ring” of points (i(i,it 2 ) where / is minimized. This illustrates the 
existence of multiple points with zero gradient and, thus, directions where the hessian of the objective is zero. 


F.l Restricted isometry property and restricted strong convexity 

To shed some light on the notion of restricted strong convexity and how it relates to the RIP, consider the 
matrix sensing problem, as described above. According to (44), we consider the quadratic loss function: 


f(X) = h\b-A(X)W 


F- 


Since the Hessian of / is given by A* A, restricted strong convexity suggests that [59] : 

\\A(Z)\\ 2 2 >C-\\Z\\ 2 F , zer", 


for a restricted set of directions Z , where C > 0 is a small constant. This bound implies that the quadratic loss 
function, as defined above, is strongly convex in such a restricted set of directions iZp 2 ] 

A similar but stricter notion is that of restricted isometry property for low rank matrices 12011541 : 

Definition F.l (Restricted Isometry Property (RIP)). A linear map A satisfies the r-RIP with constant 6 r , if 


(1 - ^)||X|||, < ||xl(X)||| <(14- <5 t .)||X|||, 5 


is satisfied for all matrices X £ 


such that rank(X) < r. 


The correspondence of restricted strong convexity with the RIP is obvious: both lower bound the quantity 
||A(X) || 2 , where A' is drawn from a restricted set. It turns out that linear maps that satisfy the RIP for low 
rank matrices, also satisfy the restricted strong convexity; see Theorem 2 in [21] , 

By assuming RIP in (441, the condition number of / depends on the RIP constants of the linear map A; in 
particular, one can show that k = ^ oc W|, since the eigenvalues of A*A lie between 1 — 6 and 1 + 6, when 
restricted to low-rank matrices. For 6 sufficiently small and dimension n sufficiently large, k ~ 1, which, with 
high probability, is the case for A drawn from a sub-Gaussian distribution. 


F.2 Comparison 

Given the above discussion, the following hold true for Fgd, under RIP settings: 

22 One can similarly define the notion of restricted smoothness condition, where ||yt(Z)|| 2 is upper bounded by |Z11 2 .. 
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(z) In the noiseless case, b = -4(A'*) and thus, ||V/(A*)|| 2 = || — 2A* [b — _4(A'*)) || 2 = 0. Combined with the 
above discussion, this leads to convergence rate factor 


in Fgd. 

(ii) In the noisy case, b = A{X*) + e where e is an additive noise term; for this case, we further assume that 
||^4 (e) ||2 is bounded. Then, 


< 1 - 


t(U*) 2 A _ l| - A(e)l| 2 _ 

T \ U r-> + ) 


Table [5] summarizes convergence rate factors a and initialization conditions of state-of-the-art approaches 
for the noiseless case. 


Reference 

Dist (U + , U* 

) 2 < 

a - Dist (U,U*) 2 

Dist 

(u°,u*) < ■•■ 

m 


a = 

1 t 

16 

V&8 


m 

a = 

■- 1 - 

ci 

XWF 


\°r{U* r ) 

m 

a = 

i - 

c 2 

r(U*)-r 


^T& ' <J r(U r ) 

m 

a = 

i - 

c 3 

r(Uf) 10 

(1 

- t) ■ a r (U*) 

This work 

a = 

= 1 - 

c 4 

<Uf ) 2 

1 

100 ' 



Table 2: Comparison of related work for the matrix sensing problem. All methods use UU T parametrization of 
the variable A and admit linear convergence, r = y/128 according to [23]. Ci > 0, Vz denote absolute constants. 
In [32] , the proposed algorithm is designed to solve the rectangular case where X = UV T ; the reported factor a 
and initial conditions could be improved for the case of ([2]). t Note that this convergence is in terms of subspace 
distance. 


F.3 Empirical results 

We start our discussion on empirical findings with respect to the convergence rate of the algorithm, how the 
step size and initialization affects its efficiency and some comparison plots with an efficient first-order projected 
gradient solver. We note that the experiments presented below are performed as a proof of concept and are not 
complete in the set of algorithms we could compare with. 

Linear convergence rate and step size selection: To show the convergence rate of the factored gradient 
descent in practice, we solve affine rank minimization problems instances with synthetic data. In particular, the 
ground truth A* £ R” xrl is synthesized as a rank-r matrix as A* = U* (t/*) T , where U* £ M nxr . In sequence, 
we sub-sample A* by observing m = C sam • p • r entries, according to: 

y = A(X*) £ R m . (45) 

We use permuted and sub-sampled noiselets for the linear operator A : R nx " — > R m ; for more information, see 
m- ye R m contains the linear measurements of A* through A in vectorized form. We consider the noiseless 
case, for ease of exposition. Under this setting, we solve § with f(UU T ) := x /2 -\\y — A ( UU T ) |||. We use as 
a stopping criterion the condition \\U + ( U + ) T — UU t \\f < tol ■ ||[/ + ( U + ) T ||f where tol := 5 ■ 10 -6 . 

Figure [5] show the linear convergence of our approach as well as the efficiency of our step selection, as 
compared to other arbitrary constant step size selections. All instances use our initialization point. It is worth 
mentioning that the performance of our step size can be inferior to specific constant step size selections; however, 
finding such a good constant step size usually requires trial-and-error rounds and do not come with convergence 
guarantees. Moreover, we note that one can perform line search procedures to find the “best” step size per 
iteration; although, for more complicated / instances, such step size selection might not be computationally 
desirable, even infeasible. 

Impact of avoiding low-rank projections on the PSD cone: In this experiment, we compare factored 
gradient descent with a variant of the Singular Value Projection (SVP) algorithm 41] [8p| For the purpose of 
this experiment, the SVP variant further projects on the PSD cone, along with the low rank projection. Its main 
difference is that it does not operate on the factor U space but requires projection over the (low-rank) positive 
semi-definite cone per iteration. In the discussion below, we refer to this variant as SVP (SDP). 
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Figure 5: Median error per iteration of factored gradient descent algorithm for different step sizes, over 20 
Monte Carlo iterations. The number of measurements is fixed to C^m • n ■ r for varying C sam £ {4, 6,10}. Here, 
n = 1204 and rank r = 5. Curves show convergence behavior of factored gradient descent as a function of the 
step size selection. One can observe that arbitrary step size selections can lead to slow convergence. Moreover, 
good constant step size selections - for a specific problem configuration, do not necessarily translate into good 
performance for a different setting; e.g., observe how the constant step size convergence rates worsen faster , as 
we decrease the number of observations. 



0 50 100 150 200 250 300 350 400 

Number of iterations 


Rank r = 5 



50 100 150 200 

Number of iterations 


250 


Rank r = 10 


Our scheme - p — n — 512 - Time: 7.9902 sec 
Our scheme -p — n — 1024 - Time: 29.4016 sec 
■ i Our scheme -p — n — 2048 - Time: 115.6227 sec 
(SDP) -p — n — 512 - Time: 26.1453 sec 
(SDP) - p = n = 1024 - Time: 162.3705 sec 
(SDP) -p — n — 2048 - Time: 1113.4051 sec 


Figure 6: Median error per iteration for factored gradient descent and SVP (SDP) algorithms, over 20 Monte 
Carlo iterations. For all cases, the number of measurements is fixed to C sa , m -n-r for C' sam = 6. From left to right, 
we consider different rank configurations: (i) r = 5 and ( ii) r = 10. Both schemes use the same initialization 
point. Both plots show better convergence rate performance in terms of iterations due to our step size selection. 
In addition, factored gradient descent avoids performing SVD operations per iteration, a fact that leads also to 
lower per iteration complexity; see also Table [3] 


We perform two experiments. In the first experiment, we compare factored gradient descent with SVP (SDP), 
as designed in [41j : i.e., while we use our initialization point for both schemes, step size selections are different. 
Figure [b] shows some convergence rate results: clearly our step size selection performs better in practice, in 
terms of the total number of iterations required for convergence. 

In the second experiment, we would like to highlight the time bottleneck introduced by the projection 
operations: for this aim, we use the same initialization points and step sizes for both the algorithms under 
comparison. Thus, the only difference lies in the SVD computations of SVP (SDP) to retain a PSD low rank 
estimate per iteration. Table [3] presents reconstruction error and execution time results. It is obvious that 
projecting on the low-rank PSD code per iteration constitutes a computational bottleneck per iteration, which 
slows down (w.r.t. total time required) the convergence of SVP (SDP). 

Initialization. Here, we evaluate the importance of our initialization point selection: 

X ° := V+ (||V/(0)-W(Lei)||J (46) 

23 SVP is a non-convex, first-order, projected gradient descent scheme for low rank recovery from linear measurements. 
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Model 

\\x - x*\ 

|f/||A'*||f 

Time (sec) 

n 

r 

SVP (SDP) 

Our scheme 

SVP (SDP) 

Our scheme 


5 

1.1339e-03 

8.4793e-04 

36.9652 

11.4757 

512 

10 

4.6552e-04 

4.4954e-04 

19.6089 

7.9902 


20 

1.6541e-04 

2.0571e-04 

10.6052 

6.4149 


5 

2.4224e-03 

9.9180e-04 

225.6230 

43.0964 

1024 

10 

1.0203e-03 

4.5103e-04 

121.7779 

29.4016 


20 

4.1149e-04 

2.3442e-04 

67.6272 

22.9616 


5 

4.8500e-03 

1.0093e-03 

1512.1969 

173.5237 

2048 

10 

2.0836e-03 

4.6735e-04 

835.0538 

115.6227 


20 

9.4893e-04 

2.6417e-04 

458.8766 

88.1960 


Table 3: Summary of comparison results for reconstruction and efficiency. Observe that both our scheme and 
SVP (SDP) require more iterations to converge as r radically decreases. This justifies the higher time-complexity 
observed; see also Figure [ 6 ] for comparison. 


To do so, we consider the following settings: we compare random initializations against the rule (46), both for 
constant step size selections and our step size selection. In all cases, we work with the factored parametrization. 

Figure [7] shows the results. Left panel presents results for constant step size selections where 77 = 0A /\\u\\ 2 F 
and right panel uses our step size selection; again, note that the selection of the constant step size is after 
many trial-and-errors for best step size selection, based on the specific configuration. Both figures compare 
the performance of factored gradient descent when ( i ) a random initialization point is selected and, (ii) our 
initialization is performed, according to (46). All curves depict median reconstruction errors over 20 Monte 
Carlo iterations. For all cases, the number of measurements is fixed to C s am • n ■ r for C sam = 10, n = 1024 and 
rank r = 20. 




Figure 7: Median error per iteration for different initialization set ups. Left panel presents results for constant 
step size selections where 77 = 0A /\\u\\' 2 F an d right panel uses our step size selection. Both figures compare 
the performance of factored gradient descent when (i) a random initialization point is selected and, ( ii ) our 
initialization is performed, according to (46). All curves depict median reconstruction errors over 20 Monte 
Carlo iterations. For all cases, the number of measurements is fixed to C sam • n ■ r for C sam = 10, n = 1024 and 
rank r = 20 . 


Dependence of a on |. Here, we highlight the dependence of ^{x*) on the convergence rate of factored 

gradient descent. Consider the following matrix sensing toy example: let X* := U* (t/*) T £ R raxn for n = 50 
and assume rank(X*) > r. We desire to compute a (at most) rank-r approximation of X* by minimizing the 
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simple least squares loss function: 


minimize - 11 X — AT* 11 2 F 

xei»x» 2" IIF (47) 

subject to A' y 0, rank(AT) < r 

For this example, let us consider r = 3 and design X* according to the following three scenarios: we fix 
= er 2 (A*) = 100 and vary cr 3 (X*) £ {1,10,20}. This leads to condition numbers for these three cases 
as: (*) V 3 (jf*j = 100) (m) = 10 and, (Hi) = 5. The convergence behavior is shown in Figure 

[ 8 ](Left panel). It is obvious that factored gradient descent suffers - w.r.t. convergence rate - as the condition 
number get worse; especially, for the case where ^*1 = 100 , factored gradient descent reaches a 

plateau after the ~80-th iteration, where the steps towards solution become smaller. As the condition number 
improves, factored gradient descent enjoys faster convergence to the optimum, which shows the dependence of 
the algorithm on ) a -^ so i n P rac ti ce - 

As a second setting, we fix r = 2, thereby computing a rank-2 approximation. As Figure §R ight panel) 
illustrates, for all values of 03 (A*), factored gradient descent performs similarly, enjoying fast convergence 
towards the optimum A*. Thus, while the condition number of original X* varies to a large degree for r = 3, 
the convergence rate factor a only depends on = 1, for r = 2. This leads to similar convergence behavior 

for all three scenarios described above. 




Figure 8 : Toy example on the dependence of a on the term ^(x*) • Here, AT* := U* ( U*) T £ K nxn for 

n = 50. We use factored gradient descent to solve (47) for r = 3. Left panel: As condition number 
improves, factored gradient descent enjoys faster convergence in practice, as dictated by our theory. Right 
panel: convergence rate behavior of factored gradient descent when r = 2 in (47). 


G Test case II: Quantum State Tomography 


As a second example, we consider the quantum state tomography (QST) problem. QST can be described as 
follows: 


U(X)-y\\% 
subject to rank(X) = r, Tr(X) < 1. 


minimize 

x^o 


(48) 


In this problem, we look for a density matrix X* £ 


of a g-bit quantum system from a set of QST 


measurements y £ R m , m <C n 2 , that satisfy y = A(X*) + 77 . Here, (A(X*))i = Tr(I?,;X*) and r/i could be 
modeled as zero-mean Gaussian. The operators Ei £ R nx " are typically the tensor product of the 2x2 Pauli 
matrices (541 . The density matrix is a priori known to be Hermitian, positive semi-definite matrix that satisfies 
rank(X*) = r and is normalized as Tr(A*) = 1 |33| : here, n = 2 q . Our task is to recover A'*. 


One can easily transform (48) into the following re-parameterized formulation: 


minimize ||^l(L/L/ T ) — 2 /||l 

c / eR" xr (49) 

subject to \\U\\ 2 f < 1 . 
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Rank r = 20 - Samples m = 3 ■ n ■ r 
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Figure 9: Left and middle panels: Convergence performance of algorithms under comparison w.r.t. ^ 
vs. (z) the total number of iterations (left) and (ii) the total execution time. Both cases correspond to the case 
C S am = 3, r = 1 (pure state setting) and q = 12 (ie., n = 4096). Right panel: Almost pure state (r = 20). 
Here, Cg am = 3. 


It is apparent that this problem formulation is not included in the cases Fgd naturally solve, due to the Frobenius 
norm constraint on the factor U. However, as a heuristic, one can alter Fgd to include such projection: per 
iteration, each putative solution U + can be trivially projected onto the Frobenious norm ball \\U\\ 2 F < 1; let 
us call this heuristic projFGD. We compare such algorithm with state-of-the-art scheme for QST to show the 
merits of our approach. The analysis of such constraint cases is an interesting and important extension of this 
paper and is left for future work. 


State-of-the-art approaches. One of the first provable algorithmic solutions for the QST problem was 
through convexification j29 ( l65 [ I2T]: this includes nuclear norm minimization approaches [331 (using e.g., [7j), 
as well as proximal variants [315! (using e.g., [17]). Recently, [79] presented a universal primal-dual convex 
framework, which includes the QST problem as application and outperforms the above approaches, both in 
terms of recovery performance and execution time. 

From a non-convex perspective, apart from Hazan’s algorithm ]3U - see Related work, [5] propose Random¬ 
ized Singular Value Projection (RSVP), a projected gradient descent algorithm for (48), which merges gradient 
calculations with truncated SVDs via randomized approximations for computational efficiency. 


Experiments. Figure|9](two-leftmost plots) illustrates the iteration and timing complexities of each algorithm 
under comparison, for a pure state density recovery setting (r = 1). Here, q = 12 which corresponds to a 
n 2 +1 ^ = 8, 390,656 dimensional problem; moreover, we assume C sam = 3 and thus the number of measurements 
are m = 12, 288. For initialization, we use the proposed initialization for all algorithms. It is apparent that 
Fgd converges faster to a vicinity of X *, as compared to the rest of the algorithms; observe also the sublinear 
rate of SparseApproxSDP in the inner plots, as reported in [34) . 


H Test case III: PSD problems with high-rank solutions 


As a final example, we consider problems of the form: 


minimize f(X) subject to X >z 0, 

A'€R"> ! “ 

where X* is the minimizer of the above problem and rank(A'*) = O(n). In this particular case and assuming 
we are interested in finding high-ranked X *, we can reparameterized the above problem as follows: 

minimize f(UU T ). 

(/£RnXO(n) 


Observe that U is a square n x 0(n ) matrix. Under this setting, Fgd performs the recursion: 

- v Xf(UU T )- . 

nxO(n) nxO{n) ^ nxn ^ nxO(n) 


Due to the matrix-matrix multiplication, the per-iteration time complexity of Fgd is 0(n 3 ), which is comparable 
to a SVD calculation of a nxn matrix. In this experiment, we study the performance of Fgd in such high-rank 
cases and compare it with state-of-the-art approaches for PSD constrained problems. 
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For the purpose of this experiment, we only consider first-order solvers; i.e ., second order methods such as 
interior point methods are excluded as, in high dimensions, it is prohibitively expensive the hessian of /. To 
this end, the algorithms to compare include: (?') standard projected gradient descent approach [35] and (ii) 
Frank-Wolfe type of algorithms, such as the one in [34] . We note that this experiment can be seen as a proof 
of concept on how avoiding SVD calculations help in practice 

Experiments. We consider the simple example of matrix sensing [49] : we obtain a set of measurements 
y e R m according to the linear model: 


y = A(X*). 


Here, A : R raxrl —► R m is a sensing mechanism such that (^(X))^ = Tr(AjX) for some Gaussian random 
matrices A,;, i = 1,..., m. The ground truth matrix X* is design such that rank(X*) = n /& and Tr(AT*) = l p’j 
Figure 10 and Table [4] show some results for the following settings: (i) n = 1024, r = n/4 and m = 2 nr, 
(ii) n = 2048, r = n/4 and m = 2nr, (Hi) n = 2048, r = n/8 and m = Anr. From our finding, we observe 
that, even for high rank cases—where r = 0(n) —performing matrix factorization and optimizing over the 
factors results into a much faster convergence, as compared to low-rank projection algorithms, such as RSVP in 
[8j. Furthermore, Fgd performs better than SparseApproxSDP [34] in practice: while SparseApproxSDP is a 
Frank-Wolfe type-of algorithm (and thus, the per iteration complexity is low), it admits sublinear convergence 
which leads to suboptimal performance, in terms of total execution time. However, RSVP and SparseApproxSDP 
algorithms do not assume specific initialization procedures to work in theory. 


I Convergence without tail bound assumptions 

In this section, we show how assumptions (A3) and (A4) can be dropped by using a different step size 77 , where 
spectral norm calculation of two n x r matrices is required per iteration. Here, we succinctly describe the main 
theorems and how they differ from the case where 77 as in ([ 8 ]) is used. We also focus only on the case of restricted 
strongly convex functions. Similar extension is possible without restricted strong convexity. 

Our discussion is organized as follows: we first re-define key lemmas (e.< 7 ., descent lemmas, etc.) for a 
different step size; then, we state the main theorems and a sketch of their proof. In the analysis below we use 
as step size: 

1 

11 ~ l&(M\\X\\ 2 + \\Xf(X)Q u Qj J \\ 2 ) 


1.1 Key lemmas 

Next, we present the main descent lemma that is used for both sublinear and linear convergence rate guarantees 
of Fgd. 

24 Here, we assume a standard Lanczos implementation of SVD, as the one provided in Matlab enviroment. 

25 The reason we design X * such that Tr(X*) is such that the algorithm SparseApproxSDP m applies; this is due to the fact 
that SparseApproxSDP is designed for QST problems, where trace constraint is present in the optimization criterion—see also |48| 
without the rank constraint. 


Algorithm 

||X-X*|| F /||X*|| F 

Total time (sec 

0 

Time per iter, (sec - median) 


Setting: n = 1024, 

, r = n/4, m = 

2 nr. 


RSVP 

4.9579e-04 

1262.3719 


2.1644 

SparseApproxSDP 

3.3329e-01 

895.9605 


2.1380e-01 

FGD 

1.6763e-04 

57.8495 


2.1961e-01 


Setting: n = 2048, 

, r = n/4, m = 

2 nr. 


RSVP 

4.9537e-04 

8412.6981 


14.6811 

SparseApproxSDP 

3.3526e-01 

26962.0379 


8.7761e-01 

FGD 

1.6673e-04 

272.8102 


1.0040e+00 


Setting: n = 2048, 

, r = n/8, m = 

4nr. 


RSVP 

2.4254e-04 

1945.6714 


5.9763 

SparseApproxSDP 

9.6725e-02 

3506.8147 


8.6440e-01 

FGD 

3.8917e-05 

68.5689 


9.2567e-01 


Table 4: Comparison of related work in high-rank matrix sensing problems. We construct A'* with Tr(X*) = 1 
such that [M] applies. It is apparent that avoiding SVDs helps in practice. 


37 















n = 1024 - r = n/4 - m = 2 • n ■ r 


n = 2048 - r = n/4 - m = 2 • n ■ r 


n = 2048 - r = n/8 - m = 4 • n • r 



Number of iterations 



Number of iterations 
n = 2048 - r — n/4-m — 2-n-r 


10 u 


10 ' 


o 


10 '- 


-RSVP 
- FGD 





10 u 


10“ 


RSVP 
— FGD 



0 100 200 300 400 500 

Number of iterations 
n = 2048 - r = n/8 - m = 4 • n • r 




= 10’* 


10“ 


- -RSVP 

— FGD 




2000 4000 6000 8000 

Cumulative time (sec) 


0 500 1000 1500 2000 2500 

Cumulative time (sec) 


Figure 10: Convergence performance of algorithms under comparison w.r.t. 11 ||jy' < ,||J, F vs. (?) the total number 
of iterations (top) and (ii) the total execution time (bottom). 


Lemma 1.1 (Descent lemma). For f being a M-smooth and (m, r)-strongly convex function and under assump¬ 
tions ( A2 ) and f(X + ) > f(Xf), the following inequality holds true: 

k(U-U + ,U- UfBfj) > lv\\Xf(X)U\\ 2 F + ^ • a r (X*)\\Af F . 

Proof of Lemma \I.1\ By ( p~3] ) , we have: 

(Vf(X)U, U - UfBfj) = 1 (Xf(X),X - X*) + \ (Xf(X), AA t ) , (50) 


Step I: Bounding (V/(X), X — X*). For this term, we have a variant of Lemma 6.2 as follows: 

Lemma 1.2. Let f be a M-smooth and (m,r)-restricted strongly convex function with optimum point X*. 
Assume f(X + ) > /(A'*). Let X = UU T . Then, 

(Xf(x),x - Xf) > M||v/ ( x)f/|| 2 F + f \\x - x;f F , 

where rf = ' 


The proof of this lemma is provided in Section J.l 


6.3 


Step II: Bounding (X f(X), AA T ). For the second term, we have the following variant of Lemma 

Lemma 1.3. Let f be M-smooth and (m,r)-restricted strongly convex. Then, under assumptions ( A2) and 
f(X + ) > fi^Cf), the following bound holds true: 


<V/(A), AA t ) > -f \\Xf(X)U\\ 2 F - . ||A|| 

Proof of this lemma can be found in Section IJ.2I 


2 

F- 


Step III: Combining the bounds in equation (50). The rest of the proof is similar to that of Lemma 6.1 


□ 


1.2 Proof of linear convergence 

For the case of (restricted) strongly convex functions /, we have the following revised theorem: 
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Theorem 1.4 (Convergence rate for restricted strongly convex /). Let current iterate be U and X = UU T . As- 

. Then under assumptions 

(51) 


sume Dist(J7, U*) < p'a r {U'*) and let the step size befi = 16 (M ||a-|| 2+ || V /(.y)q pQ t[ 
(A2) and f(X + ) > f(X*), the new estimate U + = U — r/Xf(X) ■ U satisfies 


where a = 1 — 


mcr r (X*) 

64(M||.Y*|| 2 + ||V/(X;)|| 2 


Dist (U + , U, r *) 2 < a • Dist (U, C7*) 2 , 

. Furthermore, U + satisfies Dist(U + ,U*) < p'a r (U*). 


The proof follows the same motions with that of theorem |4.2| except from the fact Lemmas 1.2 and 1.3 
used. 


are 


J Main lemmas for convergence proof without tail bound assump¬ 
tions 


J.l Proof of Lemma 11.21 

Let U + = U — rjX f{X)U and X + = U + (U + ) T . By smoothness of /, we get: 

f(X) > f(X+) - (V/(X), x + -x)-f \\x+ - x\\ 2 f 


> /PC) - (Xf(X),X+ -x)-f \\X + - X 


IF > 


(52) 


where (i) follows from hypothesis of the lemma and since X + is a feasible point (X + 0) for problem (JT|) . 

Finally, since rank(X/) = r, by the (m, r)-restricted strong convexity of /, we get, 


/PC) > f(X) + (Xf(X),X* - X) + f ||X* - x\ 


F ■ 


Combining equations (52) and (1531), we obtain: 


(xf(x),x- x ,*> > (xf(x),x - x + ) - f \\x + - x\\ F + f \\x: - X\\ 


(53) 


(54) 


instead of (25) in the proof where is used. The rest of the proof follows the same motions with that of Lemma 


6.2 and we get: 

(V/P C),X - X*) > ^|| Xf(X)U\\l + f \\x* - xf F . 

Moreover, for the case where / is just M -smooth and A'* = X*, the above bound becomes: 

(V/(A),A-A r *)>^||V/(A)t/||) 

This completes the proof. 


If- 


J.2 Proof of Lemma 11.31 

Similar to Lemma 16.31 we have: 


\QuQuXf(X)h ■ ||A||| = ff 16M||A|| 2 ||g £/ QjV/(A)|| 2 ||A|||+16||g £/ Q/V/(A)||^ ■ 


At this point, we desire to introduce strong convexity parameter m and condition number k in our bound. In 
particular, to bound term A, we observe that \\QuQyX f(X)\\ 2 < or WQuQjj^f{X)\\ 2 > ToTiu*) • This 

results into bounding A as follows: 


M\\X\\ 2 \\QuQuXf(X)\\ 2 \\A\\ 2 F 


< max 


16 • T] ■ M 11 X 11 2 ' m (T r ( X ) 


40t(C7* ) 


||A|||, rj- 16 • mT{U;)KT{X)\\QuQl T Xf{X)\\l • ||A||| 


} 


16-rj-M\\X\\ 2 -mcT r (X) 
— 40 t(U*) 


IIA ||+ rj ■ 16 • A0kt{X)t{U;)\\QuQJ,X f{X)\\l • ||A|||. 


Combining the above inequalities, we obtain: 

^ ~-y(X) 


Q^V/(A)|| 2 ||A|C < ■ IIA||f + (40kt(X)t(U*) + 1) ■ 16 • vWQuQb^f{X)\\l ■ ||A||^ 

^ ■ ll A ll f + (41 kt(X;)t(U;) + 1) • 16 • fj\\QuQZVf(X)\\l ■ (p')VPC) 


— 40r(t/*) 

(Hi) . , 

^ ma r (X) 

— 40 t(U*) 

r(X) 


I A|| | 


16 • 42 • rj ■ KT{X*)r{U* r ) ■ \\Xf(X)U\\ 2 F • ^ 


< • IIAIIf + 25^*) • l|V/P0P||£. 


(55) 
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where (i) follows from rf < 16M }| A -||^ , (ii) is due to Lemma A.3 and bounding || A\\p < p'a r (U*) by the hypothesis 
of the lemma, (Hi) is due to a r (X *) < l.lcr r (X) by Lemma A.3 cr r (X)\\QuQj r \7f(X )\\2 < \\U T X f(X)\\ F and 


(41kt(A*) + 1) < 42 kt(X*). Finally, (iv) follows from substituting p' and using Lemma A. 3 
From Lemma |6.3[ we also have the following bound: 


\\Qu*RQZ*R^f( x )h < 


102 • 101t(?7*) 
99 • 100 1 


QuQjjVf(X) || 2 . 


(56) 


This follows from equation (34). Then, the proof completes when we combine the above two inequalities to 
obtain: 


<V/(X), aa t ) > - (§\\vf(x)u\\ 2 F + . |j A j||) 
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